Abstract: Estimating 3D human pose and shape from monocular video is an ill-posed problem due to depth ambiguity. Yet, most existing methods overlook the potential multiple motion hypotheses arising from this ambiguity. To tackle this, we propose a multi-candidate motion pose and shape network (MMPS-Net), which is designed to generate temporal representations of multiple plausible motion candidates and yield their adaptive fusion for 3D human pose and shape estimation. Specifically, we first propose a multi-candidate motion continuity attention (MMoCA) module to generate multiple kinematically compliant motion candidates. Second, we introduce a multi-candidate cross-attention (MCA) module to enable information passing among candidates to strengthen their relevance. Third, we develop a multi-candidate hierarchical attentive feature integration (MHAFI) module to refine the target frame’s feature representation by capturing temporal correlations within each motion candidate and adaptively integrating all candidates. By coupling these designs, MMPSNet surpasses video-based methods on the 3DPW, MPI-INF-3DHP, and Human3.6M benchmarks.
Loading