Motion-Aware Mask Feature Reconstruction for Skeleton-Based Action Recognition

Published: 01 Jan 2024, Last Modified: 10 Mar 2025IEEE Trans. Circuits Syst. Video Technol. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Despite recent advancements in masked skeleton modeling and visual-language pre-training, no method has yet been proposed to explore capturing and utilizing the rich semantic information embedded in both modalities for enhanced action recognition. To address this challenge, we propose a novel Motion-Aware Mask Feature Reconstruction (MMFR) method for the challenging task of skeleton-based action recognition. MMFR ingeniously integrates masked skeleton feature reconstruction with visual-language pre-trained model within a consolidated framework, aiming to leverage the synergistic potential of both domains. Specifically, It employs visual-language model to infuse semantic understanding into the skeleton feature reconstruction process via probability distribution distillation. Moreover, we introduce a multi-granularity semantic contrast module that refines vision-text alignment precision and augments contextual information for accurate mask reconstruction. Extensive experiments demonstrate MMFR’s superiority in skeleton-based action recognition, as well as its efficacy in zero-shot scenarios.
Loading