Are Object-Centric Representations Better At Compositional Generalization?

Are Object-Centric Representations Better At Compositional Generalization?

ICLR 2026 Conference Submission20682 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Compositional generalization, object-centric learning, visual question answering

TL;DR: We systematically study the compositional generalization capabilities of object-centric representations on a visual question answering (VQA) downstream task, comparing them to standard visual encoders.

Abstract: Compositional generalization -- the ability to reason about novel combinations of familiar concepts -- is fundamental to human cognition and a critical challenge for machine learning. Object-centric learning, representing a scene as a set of objects, has been proposed as a promising approach for achieving this capability. However, systematic evaluation of these methods in visually complex settings remains limited. In this work, we introduce a Visual Question Answering benchmark consisting of three different visual worlds to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for the capacity of the image representation, training data diversity, downstream compute, and sample size. In this study, we use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their object-centric counterparts. Our key findings reveal that (1) object-centric approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample-efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of training data diversity, sample size, or downstream compute is constrained.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 20682

Loading