Faster Slot Decoding using Masked Transformer

Published: 10 Oct 2024, Last Modified: 25 Dec 2024NeurIPS'24 Compositional Learning Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Compositional Representation, Object-Centric Learning, Masked Token Prediction, Image Transformers
TL;DR: We propose a new slot decoder architecture using masked bidirectional transformers.
Abstract: Common object-centric learning models learn a set of representations, or "slots". Recent advancements in object-centric learning have introduced autoregressive decoders to decode slots into features or images, allowing the model to learn compositional representations from more complex and realistic datasets. However, the autoregressive decoding process is time-consuming due to its sequential nature, making it difficult to apply to downstream tasks such as video generation. In this paper, we introduce MaskSDT, a masked bidirectional transformer that decodes all slots simultaneously. Our experiments on the 3D Shapes and CLEVR datasets demonstrate that our model shows improvement in reconstruction performance and generation speed, as well as comparable results in compositional generation.
Submission Number: 31