Keywords: Representation Learning, Disentanglement, Object-Centric Learning, Transformers, Compsitionality
Abstract: Learning disentangled representations of objects in an image is a prerequisite for the robust compositional generalization in human intelligence. While progress has been made in learning such object-centric representations (OCRL), these methods rely on strong architectural priors which hinder scalability. In this work, we explore a more scalable approach for OCRL. Namely, we propose to use a general purpose architecture for OCRL and add inductive biases to the model via additional regularizers. To formulate suitable regularizers, we take inspiration from recent theoretical results which put forth two properties a model should satisfy to provably disentangle objects. We show that these properties can be scalably enforced using a VAE loss and a novel loss on the attention weights of a Transformer. We incorporate these regularizers into a general purpose Transformer autoencoder and attain competitive and often superior performance to existing methods in OCRL with stronger architectural priors.
Submission Number: 8
Loading