PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning

Angel Villar-Corrales, Sven Behnke

Published: 30 Apr 2025, Last Modified: 06 May 2025ICML 2025EveryoneCC BY 4.0

Abstract: Predicting future scene representations is a cru- cial task for enabling robots to understand and interact with the environment. However, most existing methods rely on video sequences and simulations with precise action annotations, lim- iting their ability to leverage the large amount of available unlabeled video data. To address this challenge, we propose PlaySlot, an object- centric video prediction model that infers ob- ject representations and latent actions from un- labeled video sequences. It then uses these rep- resentations to forecast future object states and video frames. PlaySlot allows to generate mul- tiple possible futures conditioned on latent ac- tions, which can be inferred from video dy- namics, provided by a user, or generated by a learned action policy, thus enabling versatile and interpretable world modeling. Our results show that PlaySlot outperforms both stochastic and object-centric baselines for video prediction across different environments. Furthermore, we show that our inferred latent actions can be used to learn robot behaviors sample-efficiently from unlabeled video demonstrations. Videos and code are available at https://play-slot. github.io/PlaySlot/.