Semantic Convergence: Investigating Shared Representations Across Scaled LLMs

Published: 22 Jun 2025, Last Modified: 22 Jun 2025ACL-SRW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoders, Mechanistic Interpretability, Semantic Subspaces, Features, Large Language Models
TL;DR: Investigating how Large Language Models of varying size internally structure features when given complex data.
Abstract: We investigate feature universality in Gemma-2 language models (Gemma-2-2B & Gemma-2-9B), asking whether models with a fourfold difference in scale still converge on comparable internal concepts. Using the sparse autoencoder (SAE) dictionary learning pipeline, we used pretrained SAEs on each model’s residual-stream activations, aligned the resulting monosemantic features via activation correlation, and compared the matched feature spaces with metrics such as SVCCA and RSA. Middle layers yield the strongest overlap, indicating that this is where both models most similarly represent concepts, while early and late layers show much less similarity. Preliminary experiments extending the analysis from single tokens to multi-token subspaces show that semantically similar subspaces tend to interact similarly with LLMs. These results offer further evidence that large language models carve the world into broadly similar, interpretable features despite size differences, reinforcing universality as a foundation for cross-model interpretability.
Archival Status: Non‑archival
Paper Length: Long Paper (up to 8 pages of content)
Submission Number: 313
Loading