Handformer2T: A Lightweight Regression-based Model for Interacting Hands Pose Estimation from A Single RGB Image

Published: 01 Jan 2024, Last Modified: 13 May 2025WACV 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Despite its extensive range of potential applications in virtual reality and augmented reality, 3D interacting hand pose estimation from RGB image remains a very challenging problem, due to appearance confusions between keypoints of the two hands, and severe hand-hand occlusion. Due to their ability to capture long range relationships between keypoints, transformer-based methods have gained popularity in the research community. However, the existing methods usually deploy tokens at keypoint level, which inevitably results in high computational and memory complexity. In this paper, we propose a simple yet novel mechanism, i.e., hand-level tokenization, in our transformer based model, where we deploy only one token for each hand. With this novel design, we also propose a pose query enhancer module, which can refine the pose prediction iteratively, by focusing on features guided by previous coarse pose predictions. As a result, our proposed model, Handformer2T, can achieve high performance while maintaining lightweight. Extensive experiments on public benchmarks demonstrate that our model can achieve state-of-the-art performance on interacting-hand pose estimation with higher throughput, less memory and faster speed.
Loading