Compositional Visual Reasoning with SlotSSMs

Published: 10 Oct 2024, Last Modified: 25 Dec 2024NeurIPS'24 Compositional Learning Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: State Space Models, Mamba, Video Models, Visual Reasoning, Slot, Object-Centric Learning
TL;DR: We propose State Space Models (SSMs), a novel framework for incorporating independent mechanisms into State Space Models (SSMs), such as Mamba, to preserve or encourage separation of information, thereby improving visual reasoning.
Abstract: In many real-world sequence modeling problems, the underlying process is inherently modular and it is important to design machine learning architectures that can leverage this modular structure. In this paper, we introduce SlotSSMs, a novel framework for incorporating independent mechanisms into State Space Models (SSMs), such as Mamba, to preserve or encourage separation of information, thereby improving visual reasoning. We evaluate SlotSSMs on long-sequence reasoning and real-world depth estimation tasks, demonstrating substantial performance improvements over existing sequence modeling methods. Our design efficiently exploits the modularity of inputs and scales effectively through the parallelizable architecture enabled by SSMs. We hope this approach will inspire future research on compositional reasoning architectures.
Submission Number: 11
Loading