Keywords: Whole-body, SMPLX Model, Human Pose and Shape Estimation, Human Mesh Recovery
Abstract: Whole-body pose and shape estimation aims to jointly predict different behaviors (e.g., pose, hand gesture, facial expression) of the entire human body from a monocular image. Existing methods often exhibit suboptimal performance due to the complexity of in-the-wild scenarios. We argue that the prediction accuracy of these models is significantly affected by the quality of the _bounding box_, e.g., scale, alignment. The natural discrepancy between the ideal bounding box annotations and model detection results is particularly detrimental to the performance of whole-body pose and shape estimation.
In this paper, we propose a novel framework to enhance the robustness of whole-body pose and shape estimation. Our framework incorporates three new modules to address the above challenges from three perspectives: (1) a **Localization Module** enhances the model's awareness of the subject's location and semantics within the image space; (2) a **Contrastive Feature Extraction Module** encourages the model to be invariant to robust augmentations by incorporating a contrastive loss and positive samples; (3) a **Pixel Alignment Module** ensures the reprojected mesh from the predicted camera and body model parameters are more accurate and pixel-aligned. We perform comprehensive experiments to demonstrate the effectiveness of our proposed framework on body, hands, face and whole-body benchmarks.
Supplementary Material: pdf
Submission Number: 11757
Loading