Abstract: Recently, self-supervised video representation learning based on Masked Video Modeling (MVM) has demonstrated promising results for action recognition. However, existing methods face two significant challenges: (1) video actions involve a crucial temporal dimension, yet current masking strategies adopt inefficient random approaches that undermine low-density dynamic motion clues in videos; (2) pre-training requires large-scale datasets and significant computing resources (including large batch sizes and enormous iterations). To address these issues, we propose a novel method named Data-Efficient Masked Video Modeling (DEMVM) for self-supervised action recognition. Specifically, a novel masking strategy named Flow-Guided Dense Masking (FGDM) is proposed to facilitate efficient learning by focusing more on the action-related temporal clues, which applies dense masking to dynamic regions based on optical flow priors, while sparse masking to background regions. Furthermore, DEMVM introduces a 3D video tokenizer to enhance the modeling of temporal clues. Finally, Progressive Masking Ratio (PMR) and 2D initialization strategies are presented to enable the model to adapt to the characteristics of the MVM paradigm during different training stages. Extensive experiments on multiple benchmarks, UCF101, HMDB51, and Mimetics, demonstrate that our method achieves state-of-the-art performance in the downstream action recognition task with both efficient data and low computational cost. More interestingly, the few-shot experiment on the Mimetics dataset shows that DEMVM can accurately recognize actions even in the presence of context bias.
0 Replies
Loading