Keywords: Gaussian Splatting, 4D Human-Scene Reconstruction, Avatar Reconstruction, Novel-view Synthesis
Abstract: Accurately capturing dynamic humans as they interact with their 3D environment from a single camera is a pivotal goal for applications spanning from assistive robotics to AR. However, current monocular approaches fall short, as they are typically restricted to reconstructing either the person or the static background in isolation. Methods that capture both often rely on cumbersome multi-view setups, limiting their real-world applicability. To this end, we propose a novel framework that reconstructs the complete 4D human-scene representation from monocular video. We formulate the task as an ill-posed inverse problem and introduce a robust regularization strategy that leverages two complementary priors: a static 3D Gaussian Splatting representation of the scene, and an animatable, SMPL-based 3D Gaussian avatar of the human. Our method jointly optimizes the camera pose, human motion, and the parameters of both priors to faithfully reconstruct time-varying geometry, appearance, and physically plausible human-scene interactions. We validate our approach on a self-collected dataset featuring synchronized human acting videos, human and scene scan videos. Our results demonstrate state-of-the-art performance, achieving average 23 dB PSNR on challenging novel views and surpassing existing monocular baselines.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 392
Loading