3D Human Pose Estimation from Multiple Dynamic Views via Single-view Pretraining with Procrustes Alignment
Abstract: 3D Human pose estimation from multiple cameras with unknown calibration has received less attention than it should. The few existing data-driven solutions do not fully exploit 3D training data that are available on the market, and typically train from scratch for every novel multi-view scene, which impedes both accuracy and efficiency. We show how to exploit 3D training data to the fullest and associate multiple dynamic views efficiently to achieve high precision on novel scenes using a simple yet effective framework, dubbed \textit{Multiple Dynamic View Pose estimation} (MDVPose). MDVPose utilizes novel scenarios data to finetune a single-view pretrained motion encoder in multi-view setting, aligns arbitrary number of views in a unified coordinate via Procruste alignment, and imposes multi-view consistency. The proposed method achieves 22.1 mm P-MPJPE or 34.2 mm MPJPE on the challenging in-the-wild Ski-Pose PTZ dataset, which outperforms the state-of-the-art method by 24.8% P-MPJPE (-7.3 mm) and 19.0% MPJPE (-8.0 mm). It also outperforms the state-of-the-art methods by a large margin (-18.2mm P-MPJPE and -28.3mm MPJPE) on the EgoBody dataset. In addition, MDVPose achieves robust performance on the Human3.6M datasets featuring multiple static cameras. Code will be released upon acceptance.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Media Interpretation
Relevance To Conference: 3D human pose estimation (HPE), which estimates the 3D coordinates of human body joints, is a fundamental technique that serves as the foundation for many multimedia tasks such as human behavior analysis, multimedia interpretation and retrieval. It is widely used in related field including video surveillance, human-computer interaction, and virtual reality.
Supplementary Material: zip
Submission Number: 2106
Loading