Abstract: Recent advancements in transformer-based methods have yielded substantial success in 2D-to-3D human pose estimation. Transformer-based estimators possess inherent advantages like the global receptive field. Nevertheless, existing transformer approaches ignore the differences among local contexts, resulting in insufficient learning of local information. To address this issue, we introduce nonuniform graph convolution to extract spatial local relationships in skeletons, remedying the limitations of traditional transformers in learning human body topology effectively. Additionally, our proposed hierarchical local temporal network (HLTN) models local temporal associations across three hierarchical levels: 1) joints; 2) body-parts; and 3) poses, effectively addressing the constraint of traditional transformers in learning localized human movements. We connect these two modules in parallel with the spatial and temporal transformer to obtain better features of skeleton sequences. Furthermore, we integrate nonuniform graph convolution with spatial Transformer methods to achieve interaction between local and global features at the attention level. Through these improved methods, our network not only effectively identifies global trends but also exhibits stronger sensitivity to local variations. Compared with the latest methods, our method achieves state-of-the-art performance on multiple datasets (Human3.6M and Mpi-Inf-3DHP).
Loading