Abstract: In this work, we explore egocentric whole-body motion capture using a single fisheye camera, which simultane-ously estimates human body and hand motion. This task presents significant challenges due to three factors: the lack of high-quality datasets, fisheye camera distortion, and hu-man body self-occlusion. To address these challenges, we propose a novel approach that leverages Fisheye ViT to ex-tract fisheye image features, which are subsequently con-verted into pixel-aligned 3D heatmap representations for 3D human body pose prediction. For hand tracking, we incorporate dedicated hand detection and hand pose esti-mation networks for regressing 3D hand poses. Finally, we develop a diffusion-based whole-body motion prior model to refine the estimated whole-body motion while accounting for joint uncertainties. To train these networks, we col-lect a large synthetic dataset, Ego WholeBody, comprising 840,000 high-quality egocentric images captured across a diverse range of whole-body motion sequences. Quantitative and qualitative evaluations demonstrate the effective-ness of our method in producing high-quality whole-body motion estimates from a single egocentric camera.
Loading