3D Hand Pose Estimation via Articulated Anchor-to-Joint 3D Local Regressors

Published: 2026, Last Modified: 07 Feb 2026IEEE Trans. Pattern Anal. Mach. Intell. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In this paper, we propose to address monocular 3D hand pose estimation from a single RGB or depth image via articulated anchor-to-joint 3D local regressors, in form of A2J-Transformer+. The key idea is to make the local regressors (i.e., anchor points) in 3D space be aware of hand’s local fine details and global articulated context jointly, to facilitate predicting their 3D offsets toward hand joints with linear weighted aggregation for joint localization. Our intuition is that, local fine details help to estimate accurate offset but may suffer from the issues including serious occlusion, confusing similar patterns, and overfitting risk. On the other hand, hand’s global articulated context can essentially provide additional descriptive clues and constraints to alleviate these issues. To set anchor points adaptively in 3D space, A2J-Transformer+ runs in a 2-stage manner. At the first stage, since the input modality property anchor points distribute more densely on X-Y plane, it leads to lower prediction accuracy along Z direction compared with those in the X and Y directions. To alleviate this, at the second stage anchor points are set near the joints yielded by the first stage evenly along X, Y, and Z directions. This treatment brings two main advantages: (1) balancing the prediction accuracy along X, Y, and Z directions, and (2) ensuring the anchor-joint offsets are of small values relatively easy to estimate. Wide-range experiments on three RGB hand datasets (InterHand2.6 M, HO-3D V2 and RHP) and three depth hand datasets (NYU, ICVL and HANDS 2017) verify A2J-Transformer+’s superiority and generalization ability for different modalities (i.e., RGB and depth) and hand cases (i.e., single hand, interacting hands, and hand-object interaction), even outperforming model-based manners. The test on ITOP dataset reveals that, A2J-Transformer+ can also be applied to 3D human pose estimation task.
Loading