Abstract: Accurate and flexible world models are crucial
for autonomous systems to understand their en-
vironment and predict future events. Object-
centric models, with structured latent spaces,
have shown promise in modeling object dy-
namics and interactions, but often face chal-
lenges in scaling to complex datasets and in-
corporating external guidance, limiting their ap-
plicability in robotics. To address these limita-
tions, we propose TextOCVP, an object-centric
model for image-to-video generation guided by
textual descriptions. TextOCVP parses an ob-
served scene into object representations, called
slots, and utilizes a text-conditioned transformer
predictor to forecast future object states and
video frames. Our approach jointly models ob-
ject dynamics and interactions while incorpo-
rating textual guidance, thus leading to accu-
rate and controllable predictions. Our method’s
structured latent space offers enhanced control
over the prediction process, outperforming sev-
eral image-to-video generative baselines. Addi-
tionally, we demonstrate that structured object-
centric representations provide superior control-
lability and interpretability, facilitating the mod-
eling of object dynamics and enabling more pre-
cise and understandable predictions. Videos and
code are available at https://play-slot.
github.io/TextOCVP/.
Loading