PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning
Abstract: Predicting future scene representations is a cru-
cial task for enabling robots to understand and
interact with the environment. However, most
existing methods rely on video sequences and
simulations with precise action annotations, lim-
iting their ability to leverage the large amount
of available unlabeled video data. To address
this challenge, we propose PlaySlot, an object-
centric video prediction model that infers ob-
ject representations and latent actions from un-
labeled video sequences. It then uses these rep-
resentations to forecast future object states and
video frames. PlaySlot allows to generate mul-
tiple possible futures conditioned on latent ac-
tions, which can be inferred from video dy-
namics, provided by a user, or generated by
a learned action policy, thus enabling versatile
and interpretable world modeling. Our results
show that PlaySlot outperforms both stochastic
and object-centric baselines for video prediction
across different environments. Furthermore, we
show that our inferred latent actions can be used
to learn robot behaviors sample-efficiently from
unlabeled video demonstrations. Videos and
code are available at https://play-slot.
github.io/PlaySlot/.
Loading