Representing Positional Information in Generative World Models for Object Manipulation

Stefano Ferraro; Pietro Mazzaglia; Tim Verbelen; Bart Dhoedt; Sai Rajeswar

Representing Positional Information in Generative World Models for Object Manipulation

Stefano Ferraro, Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Sai Rajeswar

Published: 09 Oct 2024, Last Modified: 02 Dec 2024NeurIPS 2024 Workshop IMOL PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Full track

Keywords: self-supervised learning, world models, object-centric

Abstract: The ability to predict outcomes of interactions between embodied agents and objects is paramount in the robotic setting. While model-based control methods have started to be employed for tackling manipulation tasks, they have faced challenges in accurately manipulating objects. As we analyze the causes of this limitation, we identify the cause of underperformance in the way current world models represent crucial positional information, especially about the target's goal specification for object positioning tasks. We propose two solutions for generative world models: position-conditioned (PCP) and latent-conditioned (LCP) policy learning. In particular, LCP employs object-centric latent representations that explicitly capture object positional information for goal specification. This naturally leads to the emergence of multimodal capabilities.

Submission Number: 7

Loading