Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation
Keywords: 3D Hand Pose Estimation, Gesture-Aware Pretraining, Token Fusion, Transformer, MANO
Abstract: Estimating 3D hand pose from monocular RGB images is fundamental for applications
in AR/VR, human--computer interaction, and sign language understanding.
In this work we focus on a scenario where a discrete set of gesture labels is
available and show that gesture semantics can serve as a powerful inductive bias
for 3D pose estimation.
We present a two-stage framework: gesture-aware pretraining that learns an
informative embedding space using coarse and fine gesture labels from
InterHand2.6M~\cite{moon2020interhand2}, followed by a per-joint token
Transformer guided by gesture embeddings as intermediate representations
for final regression of MANO hand parameters~\cite{romero2017mano}.
Training is driven by a layered objective over parameters, joints, and structural
constraints.
Experiments on InterHand2.6M demonstrate that gesture-aware pretraining
consistently improves single-hand accuracy over the
state-of-the-art EANet~\cite{park2023extract} baseline, and that the benefit
transfers across architectures without any modification.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 10
Loading