ST2PE: Spatial and Temporal Transformer for Pose Estimation

Cheng Jin

18 Nov 2022OpenReview Archive Direct UploadReaders: Everyone

Abstract: Spatial and temporal information and multi-view mutual promotion are two keys to solve depth uncertainty for 2D-to-3D Human pose estimation. However, each key requires a huge model to handle, so previous methods have to choose one of them to solve depth uncertainty. Thanks to transformer’s powerful long-term relationship modeling capability and latent adaptation of one view to another, we can adopt both with an acceptable number of model parameters. To the best of our knowledge, we are the first to propose a multi-view multiframe method for 2D-to-3D human pose estimation, called ST 2P E. The proposed method is combined with a Single-view Spatial and Temporal Transformer (S−ST2) and a Multiple Cross-view Transformer-based Transmission (M−CT2) module. Inspired by the structure of the Transformer in Transformer, the single-view spatial and temporal transformer has two nested transformer modules. The internal one extracts spatial information between joints, while the external one extracts temporal information between frames. The cross-view transformer-based transmission module extends the original encoder-decoder setting of Transformer into a multi-view setting. Experiments on three mainstream datasets (Human3.6M, HumanEva, MPI-INF-3DHP), demonstrate that ST2PE achieves state-of-the-art performance among all the 2D-to-3D methods.

0 Replies