Hierarchical Local Temporal Feature Enhancing for Transformer-Based 3D Human Pose Estimation

Xin Yan, Chi-Man Pun, Haolun Li, Mengqi Liu, Hao Gao

Published: 2024, Last Modified: 10 Apr 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recent advancements in transformer-based methods have yielded substantial success in 2D-to-3D human pose estimation. Transformer-based estimators have their inherent advantages like global receptive field. Nevertheless, existing transformer approaches ignore the differences among local contexts, resulting in insufficient learning of local information. To address this issue, we introduce non-uniform graph convolution to extract spatial local relationships in skeletons, remedying the limitations of traditional transformers in learning human body topology effectively. Additionally, our proposed Hierarchical Local Temporal Network (HLTN) models local temporal associations across three hierarchical levels: joints, body-parts and poses, effectively addressing the constraint of traditional transformers in learning localized human movements. We connect these two modules in parallel with the spatial and temporal transformer to obtain better features of skeleton sequences. Compared with the latest methods, our method achieves state-of-the-art performance on multiple datasets.