Abstract: Video action recognition is faced with the challenges of both huge computation burden and performance requirements. Using compressed domain data, which saves much decoding computation, is a possible solution. Unfortunately, existing compressed-domain-based (CD) methods fail to obtain high performance, compared with state-of-the-art (SOTA) raw-domain-based (RD) methods. In order to solve the problem, we propose a cross-modality knowledge distillation method to force the CD model to learn the knowledge from the RD model. In particular, spatial knowledge and temporal knowledge are first constructed to align feature space between the raw domain and the compressed domain. Then, an adaptively multi-path knowledge learning scheme is presented to help the CD model learn in a more efficient way. Experiments verify the effectiveness of the proposed method in large-scale and small-scale datasets.
0 Replies
Loading