Keywords: World Model, Human-Scene Interaction, Video Diffusion Model
TL;DR: A scene-action-conditioned video diffusion model to simulate embodied dexterous actions in a given static 3D scene
Abstract: Static 3D reconstruction has made it increasingly practical to build realistic digital twins of everyday environments, but these reconstructions remain largely non-interactive: they support navigation and view synthesis, yet cannot predict how dexterous actions change the world. We introduce Dexterous World Models (DWM), a scene-action-conditioned video diffusion framework that simulates egocentric visual dynamics induced by human hand manipulation in a known static 3D scene. Given a rendering of the static scene along a camera trajectory and an egocentric hand-mesh video encoding the action, DWM generates a temporally coherent interaction video while preserving unaltered regions of the scene. The model is initialized from a video inpainting diffusion prior, encouraging identity preservation for static content and generative residual modeling for action-induced changes. Because no large-scale real dataset provides aligned static-scene, hand-action, and interaction triplets under moving egocentric cameras, we train with a hybrid dataset that combines aligned synthetic interactions with fixed-camera real-world videos. Experiments show that DWM produces plausible object motion and articulation while preserving scene consistency, marking a step toward interactive digital twins driven by dexterous egocentric actions.
Submission Number: 12
Loading