Keywords: Mechanistic Interpretability, Large Language Models, Addition, Arithmetic, Algorithmic Reasoning, Circuits
TL;DR: We show that LLMs learn representations of integers in the addition tasks that generalize across prompt templates/number formats/languages and we reverse engineer the 2-argument addition circuit for muti-token integers in Llama 3.1 8B
Abstract: Large Language Models (LLMs) are often treated as black boxes, yet many of their behaviours suggest the presence of internal, algorithm-like structures. We present addition circuit as a concrete, mechanistic example of such a structure: a sparse set of attention heads that perform integer addition. Focusing on two popular open-source models (Llama 3.1 8B and Llama 3.1 70B), we make the following contributions. (i) We extend prior work on two-argument addition to the multi-argument setting, showing that both models employ fixed subsets of attention heads specialized in encoding summands at specific positions in addition prompts. (ii) We introduce state vectors that efficiently capture how models represent summands in their activation spaces. We find that each model learns a common representation of integers that generalizes across prompt formats and across six languages, whether numbers are expressed as Arabic digits or word numerals.
Primary Area: interpretability and explainable AI
Submission Number: 25426
Loading