Emergent Symbol Grounding in Language Models

Published: 23 Sept 2025, Last Modified: 23 Sept 2025CogInterp @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language Grounding, Mechanistic Interpretability, Language Models
Abstract: Do autoregressive LMs acquire symbol grounding in Harnad (1990)’s sense, that is, non-arbitrary, causally useful links between symbols and referents? We introduce a controlled evaluation framework that assigns each concept two distinct tokens: one appearing in non-verbal scene descriptions and another in linguistic utterances. This separation prevents trivial identity mappings and enables direct tests of grounding. Behaviorally, models trained from scratch show consistent surprisal reduction when the linguistic form is preceded by its matching scene token, relative to matched controls, and this effect cannot be explained by co-occurrence statistics. Mechanistically, saliency flow and tuned-lens analyses converge on the finding that grounding concentrates in middle-layer computations and is implemented through the gather-and-aggregate (G&A) mechanism: earlier heads gather information from scene tokens, while later heads aggregate it to support the prediction of linguistic forms. The phenomenon replicates in multimodal dialogue and across architectures (Transformers and State-Space Models), but not in unidirectional LSTMs. Together, these results provide behavioral and mechanistic evidence that symbol grounding can emerge in autoregressive LMs, while delineating the architectural conditions under which it arises.
Submission Number: 107
Loading