Interpreting the Inner Mechanisms of Large Language Models in Mathematical Addition

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: visualization or interpretation of learned representations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Model Interpretability, Large Language Models, Transformers, Mathematical Computation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We study how LLMs implement mathematical addition task and provide human-understandable interpretation with sufficient evaluation.
Abstract: Large language models (LLMs) have achieved stunning performance on various language tasks, but remain as mysterious as a black box. Understanding the internal mechanisms of LLMs could contribute to the development of more transparent and interpretable LLMs. To this end, we take the first attempt to reveal a specific mechanism relating to how LLMs implement the reasoning task of a mathematical addition, i.e., scenarios involving the addition of two integers. Through comprehensive experiments, we find that LLMs frequently involve a small fraction of attention heads (0.5% of all heads) when implementing the addition task. Meanwhile, knocking out these frequently involved heads significantly degrades the LLMs' performance on the same task. Surprisingly, these key heads identified for a specific model exhibit outstanding generalizability across multiple datasets related to the mathematical addition task. Moreover, we find an intuitive phenomenon that knocking out these key heads could also affect the performance of LLMs on mathematical subtraction, which shares the same spirit with human behavior. Our work serves as a preliminary exploration into the mathematical prowess of LLMs, laying a solid foundation to reveal more intricate capabilities.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3226
Loading