HMR-Adapter: A Lightweight Adapter with Dual-Path Cross Augmentation for Expressive Human Mesh Recovery

Wenhao Shen; Wanqi Yin; Hao Wang; Chen Wei; Zhongang Cai; Lei Yang; Guosheng Lin

HMR-Adapter: A Lightweight Adapter with Dual-Path Cross Augmentation for Expressive Human Mesh Recovery

Wenhao Shen, Wanqi Yin, Hao Wang, Chen Wei, Zhongang Cai, Lei Yang, Guosheng Lin

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Expressive Human Mesh Recovery (HMR) involves reconstructing the 3D human body, including hands and face, from RGB images. It is difficult because humans are highly deformable, and hands are small and frequently occluded. Recent approaches have attempted to mitigate these issues using large datasets and models, but these solutions remain imperfect. Specifically, whole-body estimation models often inaccurately estimate hand poses, while hand expert models struggle with severe occlusions. To overcome these limitations, we introduce a dual-path cross augmentation framework with a novel adaptation approach called HMR-Adapter that enhances the decoding module of large HMR models. HMR-Adapter significantly improves expressive HMR performance by injecting additional guidance from other body parts. This approach refines hand pose predictions by incorporating body pose information and uses additional hand features to enhance body pose estimation in whole-body models. Remarkably, a HMR-Adapter with only about 27M parameters achieves better performance in fine-tuning the large model on a target dataset. Furthermore, HMR-Adapter significantly improves expressive HMR results by combining the adapted large whole-body and hand expert models. We show extensive experiments and analysis to demonstrate the efficacy of our method.

Primary Subject Area: [Experience] Interactions and Quality of Experience

Relevance To Conference: Our work on 3D Expressive Human Mesh Recovery (HMR) from RGB images contributes to multimedia and multimodal processing by enhancing user interaction and realism in virtual environments. Firstly, we naturally integrate multimodal data by using visual data (RGB images) with spatial and kinematic data (body and hand poses) to improve the accuracy of 3D pose estimation. Secondly, accurately reconstructing 3D human bodies, including intricate details like hand interactions, leads to more realistic and responsive interactions in multimedia applications, thus improving the immersive experience in applications like augmented and mixed reality. Thirdly, our method's ability to achieve high accuracy with fewer parameters signifies a substantial advancement in efficient multimedia content processing, paving the way for new applications in gaming, virtual reality, and interactive media. Last but not least, our research provides a foundation for further innovations in multimedia content creation, particularly in areas requiring detailed and accurate human models, such as animation and simulation. This can enrich the multimedia content landscape and offer new creative possibilities for content creators.

Supplementary Material: zip

Submission Number: 5247

Loading