HSImul3R: Reconstructing Simulation-Ready Human-Scene-Interaction from Sparse Views

18 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D reconstruction, human-scene-interaction, simulation, embodied AI
TL;DR: A method for reconstructing simulation-ready human-scene-interactions from uncalibrated sparse-view inputs
Abstract: We present the first framework for simulation-ready 3D reconstruction of human–scene interactions (HSI) from sparse-view images. Prior approaches to 3D reconstruction are typically fragmented, focusing either on scene geometry or human motion, and rarely model their interactions. There are also recent attempts that reconstruct both jointly. However, they remain constrained by limited datasets or neglect the physical plausibility of interactions, and therefore fail to remain stable when deployed in simulators, which is a critical requirement for embodied AI. To address these challenges, we propose **HSImul3R** with three key contributions. Specifically, firstly, we introduce **contact-aware interaction modeling** to enforce realistic human-scene coupling within the unified 3D world coordinate system by aligning generative 3D priors with reconstructed geometry. Secondly, we propose a **scene-targeted reinforcement learning** which learns to stabilize interactions in simulation through dual supervision on motion fidelity and object proximity. To further improve the stability of this HSI simulation, we design **direct simulation reward optimization (DSRO)**, a reward-driven fine-tuning scheme that improves scene reconstructions by assessing stability under both gravity and interactions. To support training and evaluation, we further collect **HSIBench**, a new dataset featuring diverse objects, human motions, and interaction scenarios. Extensive experiments demonstrate that HSImul3R achieves the first stable, simulation-ready HSI reconstructions and substantially outperforms existing methods.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10659
Loading