Keywords: world models, human-scene interaction, dexterous manipulation, digital twin
TL;DR: A scene-action-conditioned video diffusion model to simulate embodied dexterous actions in a given static 3D scene
Abstract: Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static, limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), a scene-action-conditioned video diffusion framework that models how dexterous human actions induce dynamic changes in static 3D scenes.
Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human-scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues in the egocentric view to model action-conditioned dynamics directly. To train DWM, we construct a hybrid interaction video dataset: synthetic egocentric interactions provide fully aligned supervision for joint locomotion-manipulation learning, while fixed-camera real-world videos contribute diverse and realistic object dynamics.
Experiments demonstrate that DWM enables realistic, physically plausible interactions, such as grasping, opening, or moving objects, while maintaining camera and scene consistency. This framework establishes the first step toward video diffusion-based interactive digital twins, enabling embodied simulation from egocentric actions.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 11
Loading