Abstract: Transformer-based architecture has achieved great results in sequence to sequence tasks and vision tasks including 3D human pose estimation. However, transformer based 3D human pose estimation method is not as strong as RNN and CNN in terms of local information acquisition. Additionally, local information plays a major role in obtaining 3D positional relationships. In this paper, we propose a method that combines local human body parts and global skeleton joints using a temporal transformer to finely track the temporal motion of human body parts. First, we encode positional and temporal information, then we use a local to global temporal transformer to obtain local and global information, and finally we obtain the target 3D human pose. To evaluate the effectiveness of our method, we quantitatively and qualitatively evaluated our method on two popular and standard benchmark datasets: Human3.6M and HumanEva-I. Extensive experiments demonstrated that we achieved state-of-the-art performance on Human3.6M with 2D ground truth as input.
0 Replies
Loading