Top 4D Reconstruction Offline Generation Behavior Simulation Ablation Motion Denoising

Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos

In Submission to ICLR 2025

Anonynous Authors

TL;DR: Given monocular videos collected across a long time horizon (e.g., 1 month), we build interactive behavior models of an agent grounded in a 3D environment.

Method Overview


Agent-to-sim learns a behavior simulator in 3 steps. It first registers the agent and the scene to a canonical 3D space. Then it builds a complete and persistent spacetime 4D reconstruction that contains the agent, the scene and the observer. Finally, it learns a predictive model of agent behaviors by querying perception and motion data of the agent from the 4D reconstruction.

Results: 4D Reconstruction


Left: Reconstructions from the camera view; Right: reconstructions of the environment, the agent, and the observer from bird's-eye view. Full results on each video collection: [cat], [human], [bunny], [dog].

Results: Offline Behavior Generation

We use the 4D reconstruction as the training data to learn an agent behavior simulator. Below we show the offline-generated behavior of a cat agent in the 3D environment. Left: birds-eye view; Right: third-person view.

Online Behavior Simulation

Environment awareness. We can generates diverse environment-aware motion given the same initial state.

Observer awareness. By providing different observer motion (red triangles), the cat agent will move differently.

Long sequence generation. We can generate agent behavior over a long time horizon by conditioning on the environment and the past trajectory.
User control. We can also control the motion of an agent by manually setting the goal (the blue phere).

Ablation: Conditioning Signals (Fig. 4)

Envoronment code. Removing environment code produces a trajectory penetrating into the wall.

Past code. Removing past code introduces sudden jumps between adjacent trajectory segments.

Visualizations: Hierarchical Motion Denoising (Fig.2)

Goal denoising (w/ different conditioning signals)

Scenario: Exploring a room
Conditioned on environment.

Conditioned on environment and past trajectory. Conditioned on environment, past trajectory, and user trajectory.

Path denoising (w/ different conditioning signals)

Scenario: Jumping off the sofa
No environment conditioning. Environment conditioning.

Body motion denoising

Scenario: Following a path