Top 4D Reconstruction Results Behavior Simulation Results Visualization of the Denoising Process

Agent Real-to-Sim: Learning Interactive Behavior from Casual Videos

In Submission to NeurIPS 2024


TL;DR: We builid an simulatable agent in their familiar environment in 3D given casual video collected across a long time horizon (1 month).

We aim to answer the following question: can we simulate the behavior of an agent, by learning from casually-captured videos of the same agent recorded across a long period of time (e.g., a month)? A) We first reconstruct videos in 4D (3D and time), which includes the scene, the trajectory of the agent, and the trajectory of the observer (i.e., camera held by observer's hand). Such individual 4D reconstruction are registered across time, resulting in a complete 4D reconstructions. B) Then we learn a representation of the agent that allows for interactive behavior simulation. The behavior model explicitly reasons about goals, paths, and full body movements conditioned on the agent's ego-perception and past trajectory. Such agent representation allows us to simulate novel scenarios through conditioning. For example, conditioned different observer trajectories, the cat agent choose to walk to the carpet, stays still while quivering his tail, or hide under the tray stand.

Results: 4D Reconstruction

We show video results corresponding to Fig.2. Left: Reconstructions from camera view point; Right: reconstructions of the environment, agent, and user camera (shown as the moving coordinate) from top down view point. You may find more results on the cat dataset [here] (26 videos)

We show video results of recontructing a bunny, dog, and human agent.You may find more results on the [bunny] dataset, the [dog] dataset, and the the [human] dataset.

Results: Behavior Simulation

User conditioning: We can simulate the behavior of an agent through the proxy of user location (represented by the axis).
Goal conditioning: We can control the motion of an agent by manually setting the goals (represented by the blue spheres).


Auto-regressive generation: We can simulate the behavior of the agent over a long time horizon (more than 30s while being trained on 5.6s) by conditioning on the environment and the past trajectory. Autoregressive generation results on all agents can be found [here].

Visualizations: Hierarchical Motion Denoising (Fig.3)

Goal denoising (w/ different conditioning signals)

Scenario: Exploring a room
Conditioned on environment.

Conditioned on environment and past trajectory. Conditioned on environment, past trajectory, and user trajectory.

Path denoising (w/ different conditioning signals)

Scenario: Jumping off the sofa
No environment conditioning. Environment conditioning.

Body motion denoising

Scenario: Following a path