Multi-hypothesis representation learning for transformer-based 3D human pose estimation

Wenhao Li; Hong Liu; Hao Tang; Pichao Wang

Multi-hypothesis representation learning for transformer-based 3D human pose estimation

Wenhao Li, Hong Liu, Hao Tang, Pichao Wang

Published: 01 Jan 2023, Last Modified: 16 May 2025Pattern Recognit. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•We present a new Transformer-based method, called Multi-Hypothesis Transformer (MHFormer++), for 3D human pose estimation from monocular videos. It builds a one-to-many-to-one framework, which can effectively learn spatiotemporal representations of multiple pose hypotheses in an end-to-end manner.•A Multi-Hypothesis Generation (MHG) module is designed to capture both global and local information of human body joints within each frame and generate multiple hypothesis representations containing diverse semantic information in the spatial domain•A Self-Hypothesis Refinement (SHR) module and a Cross-Hypothesis Interaction (CHI) module are introduced to model temporal consistencies across frames and communicate among multi-hypothesis features both independently and mutually in the temporal domain•The proposed method achieves state-of-the-art performance on two challenging 3D human pose estimation benchmark datasets. Highlights (for review)

Loading