Semantic Convergence: Investigating Shared Representations Across Scaled LLMs

Daniel  Son; Sanjana Rathore; Andrew Rufail; Adrian Simon; Daniel Zhang; Soham Dave; Cole Blondin; Sean O'Brien; Kevin Zhu

Semantic Convergence: Investigating Shared Representations Across Scaled LLMs

Daniel Son, Sanjana Rathore, Andrew Rufail, Adrian Simon, Daniel Zhang, Soham Dave, Cole Blondin, Sean O'Brien, Kevin Zhu

Published: 22 Jun 2025, Last Modified: 17 Jul 2025ACL-SRW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse Autoencoders, Mechanistic Interpretability, Semantic Subspaces, Features, Large Language Models

TL;DR: Investigating how Large Language Models of varying size internally structure features when given complex data.

Abstract: We investigate feature universality in Gemma-2 language models (Gemma-2-2B & Gemma-2-9B), asking whether models with a fourfold difference in scale still converge on comparable internal concepts. Using the sparse autoencoder (SAE) dictionary learning pipeline, we used pretrained SAEs on each model’s residual-stream activations, aligned the resulting monosemantic features via activation correlation, and compared the matched feature spaces with metrics such as SVCCA and RSA. Middle layers yield the strongest overlap, indicating that this is where both models most similarly represent concepts, while early and late layers show much less similarity. Preliminary experiments extending the analysis from single tokens to multi-token subspaces show that semantically similar subspaces tend to interact similarly with LLMs. These results offer further evidence that large language models carve the world into broadly similar, interpretable features despite size differences, reinforcing universality as a foundation for cross-model interpretability.

Archival Status: Non-archival

Acl Copyright Transfer: pdf

Paper Length: Long Paper (up to 8 pages of content)

Submission Number: 313

Loading