Abstract: Highlights•We present a new Transformer-based method, called Multi-Hypothesis Transformer (MHFormer++), for 3D human pose estimation from monocular videos. It builds a one-to-many-to-one framework, which can effectively learn spatiotemporal representations of multiple pose hypotheses in an end-to-end manner.•A Multi-Hypothesis Generation (MHG) module is designed to capture both global and local information of human body joints within each frame and generate multiple hypothesis representations containing diverse semantic information in the spatial domain•A Self-Hypothesis Refinement (SHR) module and a Cross-Hypothesis Interaction (CHI) module are introduced to model temporal consistencies across frames and communicate among multi-hypothesis features both independently and mutually in the temporal domain•The proposed method achieves state-of-the-art performance on two challenging 3D human pose estimation benchmark datasets. Highlights (for review)
Loading