Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

Published: 07 May 2026, Last Modified: 07 May 2026PhysHuman Workshop @ CVPR 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: 3D Hand Pose Estimation, Gesture-Aware Pretraining, Token Fusion, Transformer, MANO
Abstract: Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human--computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M~\cite{moon2020interhand2}, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters~\cite{romero2017mano}. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet~\cite{park2023extract} baseline, and that the benefit transfers across architectures without any modification.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 10
Loading