Language Elicits Emergent Symbol Processing in Vision Foundation Models
Keywords: compositional reasoning, symbolic processing, visual reasoning, mechanistic interpretability, vision-language models, vision encoders
TL;DR: We show that language-conditioned vision encoders develop predicate-aligned geometric structures for symbol processing. We introduce predicates as labels to probe symbol processing inside the model.
Abstract: Compositional reasoning requires a model to combine linguistic constituents and visual entities into task-relevant intermediate structures.
A common assumption holds that reasoning resides in language models, while vision foundation models (VFMs) are treated as feature extractors.
We ask whether symbolic mechanisms can emerge inside the visual stream.
We train a language-conditioned vision transformer and introduce predicates as labels to measure whether layer-wise visual geometry tracks question-derived semantic conditions.
We find that the visual stream develops a three-stage mechanism: 1) feature binding segregates scenes containing the referenced entity at the middle layer, 2) object grounding resolves the binding cluster into object-level substructure at the later layer, and 3) answer matching repartitions representations into answer-aligned groups at the final layer.
Representational similarity analysis (RSA) and activation patching support this hierarchy, suggesting that language-conditioned vision encoders can instantiate symbol-like processing as query-dependent geometric organization.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 174
Loading