Keywords: Object centric learning, grounding, embodied reasoning, visual tokenizers
TL;DR: We propose a grounding method for object-centric representations.
Abstract: Several accounts of human cognition posit that our intelligence is rooted in our ability to form abstract composable
concepts, ground them in our environment, and reason over these grounded entities. This trifecta of human
thought has remained elusive in modern intelligent machines. In this work, we investigate whether slot representations
extracted from visual scenes serve as appropriate compositional abstractions for grounding and reasoning. We present the
Neural Slot Interpreter (NSI), which learns to ground object semantics in slots. At the core of NSI is an XML-like
schema that uses simple syntax rules to organize the object semantics of a scene into object-centric schema primitives.
Then, the NSI metric learns to ground primitives into slots through a structured objective that reasons over the intermodal
alignment. We show that the grounded slots surpass unsupervised slots in real-world object discovery and scale with scene
complexity. Experiments with a bi-modal object-property and scene retrieval task demonstrate the grounding efficacy and
interpretability of correspondences learned by NSI. Finally, we investigate the reasoning abilities of the grounded slots.
Vision Transformers trained on grounding-aware NSI tokenizers using as few as ten tokens outperform patch-based tokens on
challenging few-shot classification tasks.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1654
Loading