Abstract: Fine-grained action recognition is a challenging task that requires identifying discriminative and subtle motion variations among fine-grained action classes. Existing methods typically focus on spatio-temporal feature extraction and long-temporal modeling to characterize complex spatio-temporal patterns of fine-grained actions. However, the learned spatio-temporal features without explicit motion modeling may emphasize more on visual appearance than on motion, which could compromise the learning of effective motion features required for fine-grained temporal reasoning. Therefore, how to decouple robust motion representations from the spatio-temporal features and further effectively leverage them to enhance the learning of discriminative features still remains less explored, which is crucial for fine-grained action recognition. In this paper, we propose a motion representation decoupling and concentration network (MDCNet) to address these two key issues. First, we devise a motion representation decoupling (MRD) module to disentangle the spatio-temporal representation into appearance and motion features through contrastive learning from video and segment views. Next, in the proposed motion representation concentration (MRC) module, the decoupled motion representations are further leveraged to learn a universal motion prototype shared across all the instances of each action class. Finally, we project the decoupled motion features onto all the motion prototypes through semantic relations to obtain the concentrated action-relevant features for each action class, which can effectively characterize the temporal distinctions of fine-grained actions for improved recognition performance. Comprehensive experimental results on four widely used action recognition benchmarks, i.e., FineGym, Diving48, Kinetics400 and Something-Something, clearly demonstrate the superiority of our proposed method in comparison with other state-of-the-art ones.
0 Replies
Loading