Abstract: We present a novel method to analyze the earliest instant of time at which a pretrained video action recognition neural network is capable of predicting the action class, with high confidence. We exploit the fact that this problem bears similarities with pricing options in a European stock market, consequentially, our approach, Finterp, is inspired by the Black Scholes model in finance. We formulate analogies between the conceptualization of the variables involved in the Black Scholes formula and video frames to derive the appropriate algorithm. We use Finterp to extensively analyze the prediction capabilities of the neural network over time, on multiple diverse datasets. Finterp reveals that optimal frames are concentrated at low instants of time for datasets with scene bias and mid instants of time for datasets with motion bias. We demonstrate that Finterp does not compromise on the confidence of action prediction in an attempt to minimize the length of video observed. The 'Black Scholes Accuracy' for state-of-the-art 3D CNNs such as I3D and X3D stands at $81-86\%$, $64\%$ and $25\%$ for Kinetics, UAV Human and Diving-48 respectively, revealing the need to develop neural networks that can learn unique temporal signatures for various actions. Finally, we extend Finterp to make optimal time instant predictions at the hierarchical level, where similar action classes are grouped together, and show that the optimal time instant predictions are at earlier time instants than the corresponding predictions without hierarchy. We will make all code publicly available.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We've addressed the comments of the reviewers in this updated pdf. To make it easy to track changes, we've used a blue font to portions of the paper we've revised - we'll change the font color to black after acceptance. Here's a summary of the changes:
1. Clarifications on assumptions - page 15, 16 (Reviewers T7Dt, ddgM, agm1)
2. Comparisons with transformer architectures - Page 20(Reviewer ddgM)
3. Comparisons with prior art - Page 20 (Reviewer T7Dt)
4. Limitation - analysis only on trimmed videos - Page 12 (Reviewer agm1)
5. Clarification on the cost function - page 4 (Reviewers T7Dt, agm1)
6. Clarification on risk free rate - page 5 (Reviewer T7Dt)
7. Final formula for Black Scholes - page 5 (Reviewer T7Dt)
8. Optimal frame definition - page 5 (Reviewer agm1)
9. Brief background on European call options - Page 3 (Reviewer agm1 on background on financial terms)
10. Renaming gradient score to gradient norm - throughout the paper (Reviewer agm1)
11. Early action recognition differences - Page 2 (Reviewer T7Dt)
Assigned Action Editor: ~David_Fouhey2
Submission Number: 1284
Loading