InvSlotGNN: Unsupervised Discovery of Viewpoint Invariant Multiobject Representations and Visual Dynamics

Published: 01 Jan 2025, Last Modified: 18 Sept 2025IEEE Trans. Robotics 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Learning multiobject dynamics purely from visual data is challenging due to the need for robust object representations that can be learned through robot interactions. In previous work (Rezazadeh et al., 2023), we introduced two novel architectures: SlotTransport for discovering object-centric representations from singleview RGB images, referred to as slots, and SlotGNN for predicting scene dynamics from singleview RGB images and robot interactions using the discovered slots. This article introduces InvSlotGNN, a novel framework for learning multiview slot discovery and dynamics that are invariant to the camera viewpoint. First, we demonstrate that SlotTransport can be trained on multiview data such that a single model discovers temporally aligned, object-centric representations from a wide range of different camera angles. These slots bind to objects from various viewpoints, even under occlusion or absence. Next, we introduce InvSlotGNN, an extension of SlotGNN, that learns multiobject dynamics invariant to the camera angle and predicts the future state from observations taken by uncalibrated cameras. InvSlotGNN learns a graph representation of the scene using the slots from SlotTransport and performs relational and spatial reasoning to predict the future state of the scene for arbitrary viewpoints, conditioned on robot actions. We demonstrate the effectiveness of SlotTransport in learning multiview object-centric features that accurately encode visual and positional information. Furthermore, we highlight the accuracy of InvSlotGNN in downstream robotic tasks, including long-horizon prediction and multiobject rearrangement. Finally, with minimal real data, our framework robustly predicts slots and their dynamics in real-world multiview scenarios.
Loading