CRISP: Contact-guided Real2Sim from Monocular Video with Planar Scene Primitives

Published: 19 Sept 2025, Last Modified: 19 Sept 2025NeurIPS 2025 Workshop EWMEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human Scene Interaction, 4D human motion reconstruction, Physics-based simulation for control
Abstract: Modeling contact-accurate human–scene interaction from monocular video is a crucial step toward real-to-sim transfer in computer vision and robotics. However, this task remains highly challenging due to the inherent ambiguities of monocular perception and the limitations of Human Mesh Recovery (HMR) and single-view 3D geometry estimation. Existing methods often fail to capture reliable contact and scene structure, making them unsuitable for converting in-the-wild videos into simulation-ready assets. In this work, we introduce CRISP, a framework that integrates HMR, 4D reconstruction, and contact prediction into a unified front-end for recovering human motion, scene structure, and contact cues. These signals jointly guide the completion of occluded geometry, after which we fit compact planar primitives that merge the scene point cloud and the contact point cloud into a unified, simulation-friendly representation. Finally, we integrate the reconstructed assets into a physics-based simulator and use reinforcement learning to enforce realistic, contact-faithful human–scene interaction. Our approach achieves over 97\% success rate on human-centric video benchmarks (EMDB, PROX) and delivers \~1.9× faster throughput for Reinforcement Learning training compared to prior pipelines. This demonstrates the ability of CRISP to generate paired human motion and interacting environments at scale, greatly advancing real-to-sim applications in robotics and embodied AI.
Submission Number: 69
Loading