Abstract: In recent years, large language models (LLMs) have made significant advancements in arithmetic reasoning. However, the internal mechanism of how LLMs solve arithmetic problems remains unclear. In this paper, we propose explaining arithmetic reasoning in LLMs using game-theoretic interactions. Specifically, we disentangle the output score of the LLM into numerous interactions between the input words.
We quantify different types of interactions encoded by LLMs during forward propagation to explore the internal mechanism of LLMs for solving arithmetic problems. We find that (1) the internal mechanism of LLMs for solving simple one-operator arithmetic problems is their capability to encode operand-operator interaction patterns and high-order interaction patterns from input samples. Additionally, we find that LLMs with poor arithmetic capabilities focus more on context-free interactions. (2) The internal mechanism of LLMs for solving relatively complex two-operator arithmetic problems is their capability to encode operator interaction patterns from input samples. (3) An LLM gradually forgets its capability to solve simple one-operator arithmetic problems as it learns to solve relatively complex two-operator arithmetic problems.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: interpretability; math QA; feature attribution
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 238
Loading