SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels

Malte Mosbach; Jan Niklas Ewertz; Angel Villar-Corrales; Sven Behnke

SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels

Malte Mosbach, Jan Niklas Ewertz, Angel Villar-Corrales, Sven Behnke

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce an object-centric model-based RL algorithm that learns solely from pixels, enhancing interpretability and outperforming state-of-the-art methods in robotic tasks requiring reasoning and manipulation.

Abstract: Learning a latent dynamics model provides a task-agnostic representation of an agent's understanding of its environment. Leveraging this knowledge for model-based reinforcement learning (RL) holds the potential to improve sample efficiency over model-free methods by learning from imagined rollouts. Furthermore, because the latent space serves as input to behavior models, the informative representations learned by the world model facilitate efficient learning of desired skills. Most existing methods rely on holistic representations of the environment’s state. In contrast, humans reason about objects and their interactions, predicting how actions will affect specific parts of their surroundings. Inspired by this, we propose *Slot-Attention for Object-centric Latent Dynamics (SOLD)*, a novel model-based RL algorithm that learns object-centric dynamics models in an unsupervised manner from pixel inputs. We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over. Our results show that SOLD outperforms DreamerV3 and TD-MPC2 - state-of-the-art model-based RL algorithms - across a range of multi-object manipulation environments that require both relational reasoning and dexterous control. Videos and code are available at https:// slot-latent-dynamics.github.io.

Lay Summary: Teaching robots and game-playing agents is often time-consuming because most algorithms take in every pixel instead of the handful of objects that really matter. Humans, by contrast, effortlessly track the coffee mug, the table, and our hand, and predict how each will move when we act. Our work aims to bring that object-level common sense to machines. We built SOLD, a system that receives video sequences and, with no human labels, splits the scene into individual “slots” - one compact representation per object. It then learns how each slot changes over time, letting the agent imagine how the scene will evolve under different actions. Because the agent reasons in terms of objects, its inner workings are easier for people to inspect. In simulated tasks where a robot must reason over multiple objects in a scene and manipulate a specific one, SOLD masters the required skills faster and more reliably than today’s best methods. This efficiency could help to cut training costs for real-world robots and make them more interpretable, because we can see what objects they are paying attention to in order to select their action.

Link To Code: https:// slot-latent-dynamics.github.io

Primary Area: Reinforcement Learning->Deep RL

Keywords: Model-based Reinforcement Learning, Object-centric Learning, World Models, Robotics

Submission Number: 6390

Loading