Weakening Neurons: A Newly Discovered Read-Write Functionality in Transformers with Outsize Influence
Keywords: mechanistic interpretability, neurons, parameter analysis, SwiGLU, LLM
TL;DR: We compute cosine similarities of weight vectors and find a small class of neurons with outsize influence on model behavior
Abstract: We introduce a new mechanistic interpretability method for gated neurons, based
on the cosine similarities between their weight vectors, and use it to gain a number
of novel insights into the inner workings of transformer models. First, our method
allows us to discover a class of neurons – *weakening* neurons – with surprising
behavior: even though there are few, they activate extremely often and have a
large influence on model behavior. Second, we show that nine different LLMs
have similar patterns with respect to weakening neurons: weakening neurons
appear mostly in late layers whereas their counterparts, *(conditional) strengthening*
neurons, are very frequent in early-middle layers. Third, weakening neurons have
a strong effect on model output when gate values are negative – which is surprising
since negative gate values are not expected to encode functionality. Thus, for the
first time, we observe a mechanism important for transformer functionality that
involves negative gate values.
Primary Area: interpretability and explainable AI
Submission Number: 18717
Loading