Weakening Neurons: A Newly Discovered Read-Write Functionality in Transformers with Outsize Influence

ICLR 2026 Conference Submission18717 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic interpretability, neurons, parameter analysis, SwiGLU, LLM
TL;DR: We compute cosine similarities of weight vectors and find a small class of neurons with outsize influence on model behavior
Abstract: We introduce a new mechanistic interpretability method for gated neurons, based on an analysis of their read-write functionality, and use it to gain a number of novel insights into the inner workings of transformer models. First, our method allows us to discover a class of neurons -- *weakening* neurons -- with surprising behavior: even though there are few, they activate extremely often and have a large influence on model behavior. Second, we show that nine different different LLMs have similar patterns with respect to weakening neurons: weakening neurons appear mostly in late layers whereas their counterparts, *(conditional) strengthening* neurons, are very frequent in early-middle layers. Third, weakening neurons have a strong effect on model output when gate values are negative -- which is surprising since negative gate values are not expected to encode functionality. Thus, for the first time, we observe a mechanism important for transformer functionality that involves negative gate values.
Primary Area: interpretability and explainable AI
Submission Number: 18717
Loading