ExtPose: Robust and Coherent Pose Estimation by Extending ViTs

Rongyu Chen; Li'an Zhuo; Linlin Yang; Qi WANG; Liefeng Bo; Bang Zhang; Angela Yao

ExtPose: Robust and Coherent Pose Estimation by Extending ViTs

Rongyu Chen, Li'an Zhuo, Linlin Yang, Qi WANG, Liefeng Bo, Bang Zhang, Angela Yao

Published: 01 May 2025, Last Modified: 24 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Vision Transformers (ViT) are remarkable at 3D pose estimation, yet they still encounter certain challenges. One issue is that the popular ViT architecture for pose estimation is limited to images and lacks temporal information. Another challenge is that the prediction often fails to maintain pixel alignment with the original images. To address these issues, we propose a systematic framework for 3D pose estimation, called ExtPose. ExtPose extends image ViT to the challenging scenario and video setting by taking in additional 2D pose evidence and capturing temporal information in a full attention-based manner. We use 2D human skeleton images to integrate structured 2D pose information. By sharing parameters and attending across modalities and frames, we enhance the consistency between 3D poses and 2D videos without introducing additional parameters. We achieve state-of-the-art (SOTA) performance on multiple human and hand pose estimation benchmarks with substantial improvements to 34.0mm (-23%) on 3DPW and 4.9mm (-18%) on FreiHAND in PA-MPJPE over the other ViT-based methods respectively.

Lay Summary: Vision Transformers (ViT) excel in 3D pose estimation but face challenges with temporal information and pixel alignment. To address these, we developed ExtPose, a new framework that integrates 2D human skeleton images and frames into ViTs, enhancing temporal and spatial accuracy. ExtPose employs an attention-based mechanism to effectively capture and process temporal sequences, improving the consistency between 3D poses and 2D images without additional parameters. This integration results in state-of-the-art performance on key benchmarks, setting new standards for accuracy in pose estimation tasks.

Link To Code: https://gloryyrolg.github.io/extpose

Primary Area: Applications->Computer Vision

Keywords: 3D Pose Estimation, Transformer

Submission Number: 2283

Loading