Abstract: Existing depth-based 3D hand pose estimation methods typically estimate hand joints from either 2D depth images or 3D point clouds, whereas the approaches that fuse multimodal data remain underexplored. Furthermore, previous methods often struggle to learn geometric-facilitated features and precise joint correlations, especially for occluded hands, due to the lack of explicit prior guidance and insufficient cross-dimensional interaction. By taking advantage of multi-modal fusion, cross-dimensional interaction, and prior guidance, we propose a novel joint-guided keypoint denoising Transformer (named HandJoKe) to achieve more precise hand pose estimation, which can iteratively estimate hand poses based on keypoint features from both 2D depth images and 3D point clouds under explicit joint guidance within only several denoising steps. Rather than directly applying existing multi-modal fusion to perform redundant interactions among many background pixels and irrelevant points, HandJoKe focuses on modeling correlations and capturing dependencies among local informative hand regions (i.e., keypoints), thus attaining higher learning capability with lower computation redundancy. Moreover, a novel joint-guided denoising estimation strategy is introduced to adequately fuse cross-modal keypoint features under explicit joint guidance, achieving geometric-facilitated cross-modal keypoint interaction in both 2D and 3D spaces. The effectiveness of joint guidance can be further strengthened through iterative denoising, since it can subsequently update cross-modal keypoint features based on previous denoised hand poses and thus can help better locate confused joints, especially for occluded hands. Extensive experiments show that HandJoKe has achieved state-of-the-art performance on four public challenging benchmarks, including single-hand datasets NYU and ICVL, and hand-object datasets DexYCB and HO3D.
External IDs:dblp:journals/tcsv/GanCHLLG26
Loading