Abstract: Recent advancements in transformer-based methods have yielded substantial success in 2D-to-3D human pose estimation. Transformer-based estimators have their inherent advantages like global receptive field. Nevertheless, existing transformer approaches ignore the differences among local contexts, resulting in insufficient learning of local information. To address this issue, we introduce non-uniform graph convolution to extract spatial local relationships in skeletons, remedying the limitations of traditional transformers in learning human body topology effectively. Additionally, our proposed Hierarchical Local Temporal Network (HLTN) models local temporal associations across three hierarchical levels: joints, body-parts and poses, effectively addressing the constraint of traditional transformers in learning localized human movements. We connect these two modules in parallel with the spatial and temporal transformer to obtain better features of skeleton sequences. Compared with the latest methods, our method achieves state-of-the-art performance on multiple datasets.
Loading