- TL;DR: Efficient video classification using frame-based conditional gating module for selecting most-dominant frames, followed by temporal modeling and classifier.
- Abstract: CNNs are widely successful in recognizing human actions in videos, albeit with a great cost of computation. This cost is significantly higher in the case of long-range actions, where a video can span up to a few minutes, on average. The goal of this paper is to reduce the computational cost of these CNNs, without sacrificing their performance. We propose VideoEpitoma, a neural network architecture comprising two modules: a timestamp selector and a video classifier. Given a long-range video of thousands of timesteps, the selector learns to choose only a few but most representative timesteps for the video. This selector resides on top of a lightweight CNN such as MobileNet and uses a novel gating module to take a binary decision: consider or discard a video timestep. This decision is conditioned on both the timestep-level feature and the video-level consensus. A heavyweight CNN model such as I3D takes the selected frames as input and performs video classification. Using off-the-shelf video classifiers, VideoEpitoma reduces the computation by up to 50\% without compromising the accuracy. In addition, we show that if trained end-to-end, the selector learns to make better choices to the benefit of the classifier, despite the selector and the classifier residing on two different CNNs. Finally, we report state-of-the-art results on two datasets for long-range action recognition: Charades and Breakfast Actions, with much-reduced computation. In particular, we match the accuracy of I3D by using less than half of the computation.
- Keywords: Computer Vision, Action Recognition, Video Understanding, Efficient CNNs