ARTS: Semi-Analytical Regressor using Disentangled Skeletal Representations for Human Mesh Recovery from Videos

Published: 20 Jul 2024, Last Modified: 02 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Although existing video-based 3D human mesh recovery methods have made significant progress, simultaneously estimating human pose and shape from low-resolution image features limits their performance. These image features lack sufficient spatial information about the human body and contain various noises (e.g., background, lighting, and clothing), which often results in inaccurate pose and inconsistent motion. Inspired by the rapid advance in human pose estimation, we discover that compared to image features, skeletons inherently contain accurate human pose and motion. Therefore, we propose a novel semi-Analytical Regressor using disenTangled Skeletal representations for human mesh recovery from videos, called ARTS, which effectively leverages disentangled information in skeletons. Specifically, a skeleton estimation and disentanglement module is proposed to estimate the 3D skeletons from a video and decouple them into disentangled skeletal representations (i.e., joint position, bone length, and human motion). Then, to fully utilize these representations, we introduce a semi-analytical regressor to estimate the parameters of the human mesh model. The regressor consists of three modules: Temporal Inverse Kinematics (TIK), Bone-guided Shape Fitting (BSF), and Motion-Centric Refinement (MCR). TIK utilizes joint position to estimate initial pose parameters and BSF leverages bone length to regress bone-aligned shape parameters. Finally, MCR combines human motion representation with image features to refine the initial parameters of the human model and enhance temporal consistency. Extensive experiments demonstrate that our ARTS surpasses existing state-of-the-art video-based methods in both per-frame accuracy and temporal consistency on popular benchmarks: 3DPW, MPI-INF-3DHP, and Human3.6M. Code is available at https://github.com/TangTao-PKU/ARTS.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Content] Vision and Language, [Content] Multimodal Fusion, [Experience] Multimedia Applications
Relevance To Conference: This paper primarily utilizes two modalities: images and human skeletons. Previous video-based 3D human mesh recovery methods relied solely on image features. However, low-resolution image features lack sufficient spatial information about the human body and contain various non-human-related noises (e.g., background and lighting), which limits performance. Therefore, we propose a novel framework, called semi-Analytical Regressor using disenTangled Skeleton (ARTS) for human mesh recovery from videos, which incorporates disentangled skeleton representations (i.e., joint position, human motion, bone length) with image features, thereby bridging the gap the gap between human pose estimation and video-based human mesh recovery. Compared to existing video-based methods, the proposed ARTS achieves the state-of-the-art performance in both per-frame accuracy and temporal consistency on popular benchmarks. Besides, ARTS achieves significantly remarkable cross-domain generation ability, which is attributed to the utilization of accurate and consistent skeleton data. These results highlight our model’s potential for advancing multimedia human mesh recovery tasks that involve the fusion of skeleton and visual data.
Supplementary Material: zip
Submission Number: 725
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview