Abstract: Interpretability researchers have attempted to understand
MLP neurons of language models based on both the contexts in
which they activate and their output weight vectors. They
have paid little attention to a complementary aspect: the
interactions between input and output. For example, when
neurons detect a direction in the input, they might add much
the same direction to the residual stream ("enrichment neurons") or reduce its presence ("depletion neurons").
We address this aspect by examining the cosine similarity
between input and output weights of a neuron. We apply our
method to 12 models and find that enrichment neurons
dominate in early-middle layers whereas later layers tend
more towards depletion. To explain this finding, we argue
that enrichment neurons are largely responsible for
enriching concept representations, one of the first steps of
factual recall. Our input-output perspective
is a complement to activation-dependent analyses
and to approaches that treat input and output separately.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: parameter analysis, fundamental interpretability research, knowledge tracing, calibration/uncertainty
Contribution Types: Model analysis & interpretability, Theory
Languages Studied: English (but some of the models studied are multilingual, see appendix D.1)
Submission Number: 818
Loading