Agent-to-Sim: Learning Interactive Behavior from Casual Videos

10 May 2024 (modified: 06 Nov 2024)Submitted to NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Embodied Agents, Animal Reconstruction and Tracking, Motion Generation, 4D Reconstruction, Scene Reconstruction
Abstract: Agent behavior simulation empowers robotics, gaming, movies, and VR applications, but building such simulators often requires laborious effort of manually crafting the agent's decision process and motion patterns. Recent advance in visual tracking and motion capture enables learning of agent behavior from real-world data, but these methods are limited to a few scenarios due to the dependence on specialized sensors (e.g., synchronized multi-camera systems). Towards better scalability, we present a framework, Agent-to-Sim, that learns simulatable 3D agents in a 3D environment from monocular videos. To deal with partial views, our framework fuses observations in a canonical space for both the agent and the scene, resulting in a dense 4D spatiotemporal reconstruction. We then learn an interactive behavior generator by querying paired data of agents' perception and actions from the 4D reconstruction. Agent-to-Sim enables real-to-sim transfer of agents in their familiar environments given longitudinal video recordings captured with a smartphone over a month. We show results on pets (e.g., cat, dog, bunny) and a person, and analyse how the observer's motion and 3D scene affect an agent's behavior.
Supplementary Material: zip
Primary Area: Machine vision
Submission Number: 3583
Loading