Keywords: Controllable video generation, Egocentric video prediction, World model
Abstract: The recent advancements in video diffusion models have created a strong basis for developing world models with practical value. The upcoming challenge is to investigate how an agent can leverage this foundation model for understanding, interacting with, and planning within observed environments. This requires incorporating additional controllability into the model, transforming it into a versatile game engine that can be dynamically manipulated and controlled. To this end, we investigated the three key conditioning factors: camera, context frame, and text, and identified the current model design's shortcomings. More specifically, the fusion of camera embedding and features results in camera control being influenced by video features. On the other hand, while the injection of textual information compensates for unobserved spatiotemporal structures, it also intrudes into the already observed parts. To address these two issues, we propose the Spacetime Epipolar Attention Layer, which ensures that the egomotion generated by the model strictly adheres to the camera's movement. Additionally, we integrate the injection of text and context frame in a mutually exclusive manner to avoid the intrusion problem. Through extensive experiments, we demonstrate that our new model achieves unprecedented results on both the RealEstate and Epic Kitchen datasets, enabling free exploration and meaningful imagination based on observation.
Submission Number: 7
Loading