Keywords: 3D Reconstruction, SMPL-H, Computer Vision, Transformers
Abstract: In this paper, we introduce an approach to reconstruct 3D humans with expressive hands given a single image as input. Current methods for pose estimation display robust performance for either bodies or hands. Unfortunately, these methods fail to simultaneously produce accurate 3D body and hand reconstructions. To address this limitation, we take a more cohesive approach to ensure both coarser
and finer features of the human body are properly localized. Our approach is based on a feedforward network and following recent best practices, we adopt a fully transformer-based architecture. One of the key design choices we make is to leverage two separate backbone networks, one for 3D human pose and one for 3D hand pose estimation. These backbones process independently the body region and the hand regions and can make estimates about the bodies and the hands of the person. However, when the estimates are made independently, they tend to be inconsistent with one another and lead to unsatisfying reconstruction. Instead, we introduce a coupling transformer decoder that is trained to consolidate the intermediate features from the individual backbones into making a consistent estimate for the body and the hands. The full system is trained on multiple datasets, including images with body ground truth, with hand ground truth, as well as images that include both body and hand ground truth. We evaluate our approach on the
AGORA, ARCTIC, and COCO datasets, reporting metrics for both bodies and hands reconstruction accuracy to highlight our model’s robustness over previous baselines.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 22206
Loading