Semantics-aware Test-time Adaptation for 3D Human Pose Estimation

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This work highlights a semantics misalignment in 3D human pose estimation. For the task of test-time adaptation, the misalignment manifests as overly smoothed and unguided predictions. The smoothing settles predictions towards some average pose. Furthermore, when there are occlusions or truncations, the adaptation becomes fully unguided. To this end, we pioneer the integration of a semantics-aware motion prior for the test-time adaptation of 3D pose estimation. We leverage video understanding and a well-structured motion-text space to adapt the model motion prediction to adhere to video semantics during test time. Additionally, we incorporate a missing 2D pose completion based on the motion-text similarity. The pose completion strengthens the motion prior's guidance for occlusions and truncations. Our method significantly improves state-of-the-art 3D human pose estimation TTA techniques, with more than 12% decrease in PA-MPJPE on 3DPW and 3DHP.
Lay Summary: When computers try to estimate how people move in 3D from a single-view video, they often struggle—especially when parts of the person are hidden. This leads to unrealistic or static poses that don’t match the expected activity, a problem we refer to as semantic misalignment. Our research addresses this by helping the computer understand human activity when doing predictions, like walking or climbing stairs. We use ChatGPT to identify the activity from video and guide the motion predictions to align with that activity using a shared motion-language space. We also complete missing body parts in a way that matches the intended action. Our research significantly improves prediction quality, achieving over a 12% accuracy boost on major 3D human pose datasets.
Primary Area: Applications->Computer Vision
Keywords: Test-time Adaptation, 3D Human Pose Estimation
Submission Number: 15096
Loading