Keywords: Foundational work
Other Keywords: weights-based interpretability
TL;DR: We identify attention heads which construct summaries of the surrounding text, enabling the identification of context-sensitive neurons.
Abstract: We study transformer language models, analyzing attention heads whose attention
patterns are spread out, and whose attention scores depend weakly on content. We
argue that the softmax denominators of these heads are stable when the underlying
token distribution is fixed. By sampling softmax denominators from a "calibration
text", we can combine together the outputs of multiple such stable heads in the first
layer of GPT2-Small, approximating their combined output by a linear summary
of the surrounding text. This approximation enables a procedure where from the
weights alone - and a single calibration text - we can uncover hundreds of first
layer neurons that respond to high-level contextual properties of the surrounding
text, including neurons that didn’t activate on the calibration text.
Submission Number: 37
Loading