Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence

Published: 30 Sept 2025, Last Modified: 20 Nov 2025Mech Interp Workshop (NeurIPS 2025) SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoders, AI Safety, Applications of interpretability
TL;DR: We evaluate two types of polysemantic vulnerabilities in LLMs using four intervention methods and provide evidence that these vulnerabilities are transferable due to shared polysemantic structures across models.
Abstract: Polysemanticity—where individual neurons encode multiple unrelated features—is a well-known characteristic of large neural networks and remains a central challenge in the interpretability of language models. At the same time, its implications for model safety are also poorly understood. Leveraging recent advances in sparse autoencoders, we investigate the polysemantic structure of two small models (Pythia-70M and GPT-2-Small) and evaluate their vulnerability to targeted, covert interventions at the prompt, feature, token, and neuron levels. Our analysis reveals a consistent polysemantic topology shared across both models. Strikingly, we demonstrate that this structure can be exploited to mount effective intervention on two larger, black-box instruction-tuned models (Llama-3.1-8B/70B-Instruct and Gemma-2-9B-Instruct). These findings suggest not only the generalizability of the intervention strategies, but also point to a stable and transferable polysemantic structure that persists across architectures and training regimes.
Submission Number: 238
Loading