APP: Adaptive Pose Pooling for 3D Human Pose Estimation from Videos

Jinyan Zhang; Mengyuan Liu; Hong Liu; Guoquan Wang; Wenhao Li

APP: Adaptive Pose Pooling for 3D Human Pose Estimation from Videos

Jinyan Zhang, Mengyuan Liu, Hong Liu, Guoquan Wang, Wenhao Li

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Current advancements in 3D human pose estimation have attained notable success by converting 2D poses into their 3D counterparts. However, this approach is inherently influenced by the errors introduced by 2D pose detectors and overlooks the intrinsic spatial information embedded within RGB images. To address these challenges, we introduce a versatile module called Adaptive Pose Pooling (APP), compatible with many existing 2D-to-3D lifting models. The APP module includes three novel sub-modules: Pose-Aware Offsets Generation (PAOG), Pose-Aware Sampling (PAS), and Spatial Temporal Information Fusion (STIF). First, we extract latent features of the multi-frame lifting model. Then, a 2D pose detector is utilized to extract multi-level feature maps from the image. After that, PAOG generates offsets according to featuremaps. PAS uses offsets to sample featuremaps. Then, STIF can fuse PAS sampling features and latent features. This innovative design allows the APP module to simultaneously capture spatial and temporal information. We conduct comprehensive experiments on two widely used datasets: Human3.6M and MPI-INF-3DHP. Meanwhile, we employ various lifting models to demonstrate the efficacy of the APP module. Our results show that the proposed APP module consistently enhances the performance of lifting models, achieving state-of-the-art results. Significantly, our module achieves these performance boosts without necessitating alterations to the architecture of the lifting model.

Primary Subject Area: [Content] Media Interpretation

Secondary Subject Area: [Content] Vision and Language, [Experience] Multimedia Applications, [Content] Multimodal Fusion

Relevance To Conference: 3D human pose estimation is a transformative technology that enhances multimedia processing by comprehensively understanding human movement and interaction within content. It plays a pivotal role in various domains, notably enriching content analysis by enabling the recognition of body postures and activities in videos. This capability finds applications in diverse fields, such as sports analytics, facilitating deeper insights and more nuanced interpretations of multimedia content. In this paper, we present Adaptive Pose Pooling (APP), a novel module for 3D human pose estimation that improves the accuracy of 3D human pose estimation. Moreover, accurate 3D pose estimation significantly improves user experience, particularly in virtual reality (VR) and augmented reality (AR) environments. Refining user interaction with virtual elements creates a more seamless and immersive experience, thereby enhancing the overall usability and engagement of multimedia platforms. Furthermore, the technology streamlines content creation workflows by automating pose correction in animation and motion capture processes. This automation saves time and resources and ensures higher accuracy and consistency in the final output. In conclusion, 3D human pose estimation unlocks a deeper understanding of human movement in multimedia, leading to richer analysis, more intuitive interactions, and broader accessibility. This ultimately enhances the overall quality and inclusivity of multimedia experiences.

Supplementary Material: zip

Submission Number: 721

Loading