Weakening Neurons: A Newly Discovered Read-Write Functionality in Transformers with Outsize Influence
Keywords: mechanistic interpretability, neurons, parameter analysis, SwiGLU, LLM
TL;DR: We compute cosine similarities of weight vectors and find a small class of neurons with outsize influence on model behavior
Abstract: We introduce a new mechanistic interpretability method
for gated neurons,
based on an analysis of their read-write
functionality,
and use it to gain a number of novel insights
into the inner workings of transformer models.
First, our
method allows us to discover a class of neurons --
*weakening* neurons -- with surprising behavior:
even
though there are few,
they activate extremely often and have
a large influence on model behavior.
Second, we show that
nine different different LLMs have similar patterns
with respect to weakening neurons:
weakening neurons appear mostly in late layers
whereas their counterparts,
*(conditional) strengthening* neurons, are very frequent in
early-middle layers.
Third,
weakening neurons have a strong effect on model output when
gate values are negative -- which is surprising since
negative gate values are not expected to encode
functionality.
Thus, for the first time, we observe a
mechanism important for transformer functionality that
involves negative gate values.
Primary Area: interpretability and explainable AI
Submission Number: 18717
Loading