Weakening Neurons: A Newly Discovered Read-Write Functionality in Transformers with Outsize Influence

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic interpretability, neurons, parameter analysis, SwiGLU, LLM
TL;DR: We compute cosine similarities of weight vectors and find a small class of neurons with outsize influence on model behavior
Abstract: We introduce a new mechanistic interpretability method for gated neurons, based on the cosine similarities between their weight vectors, and use it to gain a number of novel insights into the inner workings of transformer models. First, our method allows us to discover a class of neurons – *weakening* neurons – with surprising behavior: even though there are few, they activate extremely often and have a large influence on model behavior. Second, we show that nine different LLMs have similar patterns with respect to weakening neurons: weakening neurons appear mostly in late layers whereas their counterparts, *(conditional) strengthening* neurons, are very frequent in early-middle layers. Third, weakening neurons have a strong effect on model output when gate values are negative – which is surprising since negative gate values are not expected to encode functionality. Thus, for the first time, we observe a mechanism important for transformer functionality that involves negative gate values.
Primary Area: interpretability and explainable AI
Submission Number: 18717
Loading