When making action classification predictions on video frames, are there ways or considerations for improving speed aside from skipping frames (cadence)? Is prediction window size a factor?
The window size is fixed once a model is trained. At prediction time, you may adjust something called stride , i.e., how often (in terms of number of frames) you create a new prediction window. The smaller the stride , the faster you make a next prediction. This can make the results to refresh faster.
There was a sample code from last year, called Detecting Human Actions in a Live Video Feed, you may check this as a reference
Your app - and the model - need to balance three concerns: Accuracy, Latency, and CPU Usage.
- A shorter window size (selected at training time) may reduce run-time latency! But if the window is shorter than the longest event you are trying to recognise, then accuracy may be reduced
- A shorter stride can also reduce latency, but results in more frequent calls to the model, which can increase energy use.
- A longer stride reduces CPU usage, and allows other functions in your app such as drawing functions more time to work - which might help reduce any performance related bugs in your app, but will increase latency.
Also, try not to skip frames (unless you did the same for your training data), otherwise, the effective action speed captured in the window will change, and this may mess up the prediction result (accuracy) (edited)
Latency: is the time between when the action happens in the real world to when the app detects it and tells the user. This will be experienced as 'snappiness'. For action classifications, latency is at least as long as the windows size, plus a bit. It's an emergent parameter caused by other choices you make.
Stride: is the time the model waits in between two detections. This can be shorter OR longer than the window length. It's a parameter used in the inference pipeline in your app.
Latency can be slightly shorter than the window size, cause the model can predict the action when it sees the partial action. That’s why a shorter stride can help to refresh the results faster (edited)
You may also be able to reduce energy, memory and CPU usage by setting the camera to a lower resolution. If your app is very busy, making this change might improve app performance. This change should not reduce model accuracy, but your user's view of the scene will not be so sharp.
Another inference speed factor might be your device. A newer device with neural engine will help. A fully charged device, and operating in good lighting condition, etc. can also help to reduce the latency (improve the Vision pose detection speed, and camera frame rate)
We have a great session on understanding the performance of apps with machine learning models this year! Check it out: