Abstract: Reconstructing a 3D hand from a single RGB image is a very challenging task. Most of the existing Transformer-based 3D hand reconstructing methods do not fully consider the local spatial information from low-level image features, which would be crucial for capturing fine details and accurate shapes of the hand. Consequently, this oversight often leads to reconstructed hands that lack the precision and realism necessary for many applications, such as augmented reality, and hand gesture recognition. To address this limitation, in this paper, we propose a novel and efficient method named HybridMETRO to both utilize low-level and high-level image features for accurate reconstructing 3D hand pose and mesh vertices from a single RGB image. Specifically, we introduce the deformable attention into the encoder of Transformer, making it no longer limited by the length of the image feature sequence. Based on the above mechanism, we further propose an interleaved updating multi-scale feature encoder to fuse low-level and high-level features. Moreover, we incorporate the Graph Convolutional Residual (GCR) module to build a novel decoder to capture explicit semantic connections between grid vertices and thus improve spatial locality of extracted features. Experimental results demonstrate that, when compared with state-of-the-art methods, our proposed HybridMETRO could achieve better performance with significantly smaller model parameters that are about half of METRO’s and a quarter of HandOccNet’s.
External IDs:doi:10.1145/3734873
Loading