Enhancing structural consistency of 3D Human Pose Estimation through Trainable Loss Function

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Trainable Loss function, Structural Consistency, 3D Human Pose Estimation, Structured Energy network
Abstract: 3D human pose estimation (3D HPE) is a challenging task due to complex structural constraints that are not well captured by standard training objectives such as mean squared error (MSE). Previous studies have attempted to enforce structural consistency by incorporating manually designed priors, rule-based constraints, or specialized architectures, which often limit adaptability. In this paper, we propose SCoTL-pose (Structural Consistency via Trainable Loss for Pose Estimation) framework that enables pose estimation models (pose-net) to learn structural dependencies directly from data, through a trainable loss function (loss-net), without explicit priors. Our approach introduces a graph-based loss-net that captures both local and global joint relationships, ensuring anatomically plausible pose predictions. While inspired by the idea of Structured Energy As Loss (SEAL), we extend it to tackle 3D human pose estimation, a task with more complex and high-dimensional structural dependencies than those considered in previous applications. To this end, we employ a graph-based model as loss-net architecture, tailored to capturing the intricate local and global dependencies among joints. SCoTL-pose can be combined with diverse backbones, from single-frame lifting networks to state-of-the-art multi-frame temporal models, without additional inference cost. To assess whether SCoTL-pose enhances structural plausibility in a quantitative manner, we also introduce Limb Symmetry Error (LSE) and Body Segment Length Error (BSLE) as evaluation metrics. Experimental results on Human3.6M, MPI-INF-3DHP, and Human3.6M WholeBody datasets demonstrate that SCoTL-pose not only reduces per-joint pose estimation errors but also generates more plausible poses, with increasing gains under more challenging settings such as single-frame or in-the-wild scenarios.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 23380
Loading