Abstract: Although feed-forward neurons in pre-trained language models (PLMs) can store knowledge and their importance in influencing model outputs has been studied, existing work focuses on finding a limited set of neurons and analyzing their relative importance.
However, the global quantitative role of activation values in shaping outputs remains unclear, hindering further advancements in applications like knowledge editing.
Our study first investigates the numerical relationship between neuron activations and model output and discovers the global linear relationship between them through neuron interventions on a knowledge probing dataset.
We refer to the gradient of this linear relationship as neuron empirical gradient (NEG), and introduce NeurGrad, an accurate and efficient method for computing NEG.
NeurGrad enables quantitative analysis of all neurons in PLMs, advancing our understanding of neurons' controllability.
Furthermore, we explore NEG's ability to represent language skills across diverse prompts via skill neuron probing. Experiments on MCEval8k, a multi-choice knowledge benchmark spanning various genres, validate NEG's representational ability.
The data and code are released.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: probing,feature attribution
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 7542
Loading