Abstract: Estimating 3D human poses from monocular images is an important research area with many practical applications. However, the depth ambiguity of 2D solutions limits their accuracy in actions where occlusion exits or where slight centroid shifts can result in significant 3D pose variations. In this paper, we introduce a novel multimodal approach to mitigate the depth ambiguity inherent in monocular solutions by integrating spatial-aware pressure information. To achieve this, we first establish a data collection system with a pressure mat and a monocular camera, and construct a large-scale multimodal human activity dataset comprising over 600,000 frames of motion data. Utilizing this dataset, we propose a pressure image reconstruction network to extract pressure priors from monocular images. Subsequently, we introduce a Transformer-based multimodal pose estimation network to combine pressure priors with monocular images, achieving a world mean per joint position error (W-MPJPE) of 51.6mm, outperforming state-of-the-art methods. Extensive experiments demonstrate the effectiveness of our multimodal 3D human pose estimation method across various actions and joints, highlighting the significance of spatial-aware pressure in improving the accuracy of monocular 3D pose estimation methods. Our dataset is available at: https://anonymous.4open.science/r/SATPose-51DD.
Primary Subject Area: [Experience] Interactions and Quality of Experience
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: In this paper, we propose a novel multimodal approach for 3D human pose estimation that mitigates the depth ambiguity inherent in monocular-vision-only solutions by incorporating spatial-aware grounding pressure information. An accurate 3D human pose estimation method serves as the foundation for downstream VR and AR applications, such as sports and fitness, medical rehabilitation, and virtual character embodiment in games. It is crucial for ensuring high-quality interactive experiences and interaction fidelity for users. 3D human pose estimation solutions based on monocular vision often struggle with actions involving occlusion or depth ambiguity due to the lack of depth information. Our approach mitigates this limitation by incorporating ground pressure information, thereby enhancing the overall 3D perception capability of the multimodal system. Furthermore, the pressure information collector appearing in the form of a pressure mat can be seamlessly integrated into devices such as fitness mats, gaming mats, etc., thus improving pose estimation accuracy without compromising the user experience. By ingeniously integrating visual and tactile modalities, our work contributes to advancing multimedia interactive technologies and multimodal processing techniques, thus addressing challenges in multimedia applications and expanding capabilities of multimedia systems.
Supplementary Material: zip
Submission Number: 5353
Loading