Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis

ACL ARR 2024 June Submission809 Authors

13 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We find arithmetic ability resides within a limited number of attention heads, with each head specializing in distinct operations. To delve into the cause of this phenomenon, we introduce the Comparative Neuron Analysis (CNA) method, which identifies an internal logic chain consisting of four distinct stages from input to prediction: feature enhancing with shallow FFN neurons, feature transferring by shallow attention layers, feature predicting by arithmetic heads, and prediction enhancing among deep FFN neurons. Moreover, we identify the human-interpretable FFN neurons within both feature-enhancing and feature-predicting stages. These findings lead us to investigate the mechanism of LoRA, revealing that it enhances prediction probabilities by amplifying the coefficient scores of FFN neurons related to predictions. Finally, we apply our method in model pruning for arithmetic tasks and model editing for reducing gender bias. Our code and data will be released on github.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: arithmetic mechanism, feature attribution, knowledge tracing
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 809
Loading