- Keywords: Lightweight action recognition, compressed videos, temporal trilinear pooling, knowledge distillation
- TL;DR: The first end-to-end lightweight solution to compressed video action recognition.
- Abstract: Most existing action recognition models are large convolutional neural networks (CNNs) that work only with raw RGB frames as input. However, practical applications require lightweight models that directly process compressed videos. In this work, for the first time, such a model is developed, which is lightweight enough to run in real-time on embedded AI devices (e.g., 40FPS on a Jetson TX2) without sacrifices in recognition accuracy. Compared to existing compressed video action recognition models, it is much more compact and faster thanks to adopting a lightweight CNN backbone. Further, a number of novel components are introduced to improve the effectiveness of the model: (1) A new Aligned Temporal Trilinear Pooling (ATTP) module is formulated to fuse three modalities in a compressed video namely I-frames, motion vectors, and residuals. (2) To remedy the weaker motion vectors (compared to optical flow computed from raw RGB streams) for representing dynamic content, we introduce a temporal fusion method to explicitly induce the temporal context, as well as knowledge distillation from a model trained with optical flows via feature alignment. Importantly, in contrast to existing models that either ignore B-frames or use them incorrectly, our ATTP model employs correct but more complicated B-frame modeling, thus being compatible with a wider range of contemporary codecs. Extensive experiments show that our ATTP outperforms the state-of-the-art alternatives in both efficiency and accuracy.