Keywords: object-centric learning, dynamics prediction, cross-attention, transformer
TL;DR: Unsupervised dynamics prediction with object-centric representations via cross-attention between slots and image patches.
Abstract: Human perception involves discerning objects based on attributes such as size, color, and texture, and making predictions about their movements using features such as weight and speed. This innate ability operates without the need for conscious learning, allowing individuals to perform actions like catching or avoiding objects when they are unaware. Accordingly, the fundamental key to achieving higher-level cognition lies in the capability to break down intricate multi-object scenes into meaningful appearances. Object-centric representations have emerged as a promising tool for scene decomposition by providing useful abstractions. In this paper, we propose a novel approach to unsupervised video prediction leveraging object-centric representations. Our methodology introduces a two-component model consisting of a slot encoder for object-centric disentanglement and a feature extraction module for masked patches. These components are integrated through a cross-attention mechanism, allowing for comprehensive spatio-temporal reasoning. Our model exhibits better performance when dealing with intricate scenes characterized by a wide range of object attributes and dynamic movements. Moreover, our approach demonstrates scalability across diverse synthetic environments, thereby showcasing its potential for widespread utilization in vision-related tasks.
Submission Number: 23
Loading