Object-Centric Image to Video Generation with Language Guidance

Angel Villar-Corrales, Gjergj Plepi, Sven Behnke

Published: 14 Feb 2025, Last Modified: 06 May 2025ArXiv PrepringEveryoneCC BY 4.0

Abstract: Accurate and flexible world models are crucial for autonomous systems to understand their en- vironment and predict future events. Object- centric models, with structured latent spaces, have shown promise in modeling object dy- namics and interactions, but often face chal- lenges in scaling to complex datasets and in- corporating external guidance, limiting their ap- plicability in robotics. To address these limita- tions, we propose TextOCVP, an object-centric model for image-to-video generation guided by textual descriptions. TextOCVP parses an ob- served scene into object representations, called slots, and utilizes a text-conditioned transformer predictor to forecast future object states and video frames. Our approach jointly models ob- ject dynamics and interactions while incorpo- rating textual guidance, thus leading to accu- rate and controllable predictions. Our method’s structured latent space offers enhanced control over the prediction process, outperforming sev- eral image-to-video generative baselines. Addi- tionally, we demonstrate that structured object- centric representations provide superior control- lability and interpretability, facilitating the mod- eling of object dynamics and enabling more pre- cise and understandable predictions. Videos and code are available at https://play-slot. github.io/TextOCVP/.