Crossformer3D: cross spatio-temporal transformer for 3D human pose estimation

Published: 2025, Last Modified: 06 Nov 2025Signal Image Video Process. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: 3D human pose estimation can be handled by encoding the geometric dependencies between the body parts and enforcing the kinematic constraints. Recently, transformers have been adopted to better encode the long-range dependencies between the joints across both the spatial and temporal domains. However, previous studies have highlighted the need of improving the locality of vision Transformers. To this end, we propose a novel pose estimation Transformer featuring rich representations of body joints that are critical for capturing subtle changes across frames (i.e., inter-feature representations). More specifically, through two novel interaction modules: Cross-Joint Interaction and Cross-Frame Interaction, the model explicitly encodes the local and global dependencies between the body joints. The proposed architecture achieves state-of-the-art performance on two popular 3D human pose estimation datasets, Human3.6 and MPI-INF-3DHP. In particular, our proposed CrossFormer3D method boosts performance by 0.9% and 3%, compared to its closest counterpart, PoseFormer, using the detected 2D poses and ground-truth settings respectively.
Loading