Joint 3-D Human Reconstruction and Hybrid Pose Self-Supervision for Action Recognition

Published: 01 Jan 2025, Last Modified: 12 Apr 2025IEEE Internet Things J. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Action recognition is a promising task of identifying human activities in videos or images. Human movement is often accompanied by occlusion and blurring, resulting in many approximate behaviors that cannot be correctly recognized. Most methods directly employ local or motion cues to improve global features. Due to insufficient exploration of 3-D depth information, similar actions from the 2-D perspective still cannot be particularity distinguished. To tackle this challenge, this article proposes a novel action recognition approach termed RPS-Net, integrating 3-D human body reconstruction and hybrid pose self-supervision (HPSS). The designed 3-D reconstruction network is named CFFormer, which leveraging a context-fusion structure to generate 3-D meshes as auxiliary input. Among them, the mesh rotated 90° (M9) is exploited to provide multiperspective associative information, which employing spatial residual learning to enhance extra 3-D cues under various interpose variations. Meanwhile, the original images and meshes from the same perspective will perform HPSS with temporal encoding. It is responsible for capturing subtle differences in key point positions and structures across multiple dimensions. Extensive experiments demonstrate that our proposed method significantly surpasses most existing algorithms. And the Top-1 recognition accuracy on SSV2, HMDB51, and Olympic Sports datasets can reach 73.4%, 87.7%, and 96.2%, respectively.
Loading