Data-Efficient Masked Video Modeling for Self-supervised Action Recognition

Qiankun Li, Xiaolong Huang, Zhifan Wan, Lanqing Hu, Shuzhe Wu, Jie Zhang, Shiguang Shan, Zengfu Wang

Published: 01 Jan 2023, Last Modified: 05 Nov 2023ACM Multimedia 2023Readers: Everyone

Abstract: Recently, self-supervised video representation learning based on Masked Video Modeling (MVM) has demonstrated promising results for action recognition. However, existing methods face two significant challenges: (1) video actions involve a crucial temporal dimension, yet current masking strategies adopt inefficient random approaches that undermine low-density dynamic motion clues in videos; (2) pre-training requires large-scale datasets and significant computing resources (including large batch sizes and enormous iterations). To address these issues, we propose a novel method named Data-Efficient Masked Video Modeling (DEMVM) for self-supervised action recognition. Specifically, a novel masking strategy named Flow-Guided Dense Masking (FGDM) is proposed to facilitate efficient learning by focusing more on the action-related temporal clues, which applies dense masking to dynamic regions based on optical flow priors, while sparse masking to background regions. Furthermore, DEMVM introduces a 3D video tokenizer to enhance the modeling of temporal clues. Finally, Progressive Masking Ratio (PMR) and 2D initialization strategies are presented to enable the model to adapt to the characteristics of the MVM paradigm during different training stages. Extensive experiments on multiple benchmarks, UCF101, HMDB51, and Mimetics, demonstrate that our method achieves state-of-the-art performance in the downstream action recognition task with both efficient data and low computational cost. More interestingly, the few-shot experiment on the Mimetics dataset shows that DEMVM can accurately recognize actions even in the presence of context bias.

0 Replies