From Tokens to Semantics: The Emergence and Stabilization of Polysemanticity in Language Models

Jonas Rohweder; Aiden Zhou; Aniruddhan Ramesh; Sharvil Limaye; Akshay Bhaskar; Ashwinee Panda; Vasu Sharma

From Tokens to Semantics: The Emergence and Stabilization of Polysemanticity in Language Models

Jonas Rohweder, Aiden Zhou, Aniruddhan Ramesh, Sharvil Limaye, Akshay Bhaskar, Ashwinee Panda, Vasu Sharma

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Developmental interpretability, Understanding high-level properties of models, Foundational work

Other Keywords: Superposition

TL;DR: The emergence and stabilization of superposition across training

Abstract: Polysemanticity—neurons activating for seemingly unrelated features—has long been viewed as a key obstacle for interpretable AI. We show instead that it follows a structured, hierarchical developmental trajectory, offering a principled perspective on how networks allocate scarce representational capacity. We present three interdependent analyses of Pythia 70M–2.8B across training checkpoints: clustering of top-activating excerpts, Jensen–Shannon divergence over frequency buckets, and a geometric characterization (polytope density and participation ratio). First, we trace representational dynamics over training: early layers encode token- and frequency-specific signals, with high- and low-frequency $n$-grams occupying distinct regions of activation space that mostly re-converge over training; deeper layers—and larger models—progressively shift toward representations that are invariant to token frequency and organized by semantic content. Second, we identify a coverage principle: neuron coverage (the fraction of positions in which a neuron participates), not raw frequency preference, predicts specialization. High-coverage neurons specialize, while low-coverage neurons remain generalists. Third, we observe that activation manifolds transition from fragmented to consolidated. Together, these results recast polysemanticity not as a static nuisance, but as a structured, evolutionary process that distributes scarce capacity efficiently and abstracts towards meaning.

Submission Number: 253

Loading