Abstract: Autoregressive large language models (LLMs) exhibit impressive
performance across various tasks but struggle with simple arithmetic,
such as additions of two or more operands. We show that this struggle
arises from LLMs’ use of a simple one-digit lookahead heuristic, which
works fairly well (but not perfect) for two-operand addition but fails
in multi-operand cases, where the carry-over logic is more complex. Our
probing experiments and digit-wise accuracy evaluation show that LLMs
fail precisely where a one-digit lookahead is insufficient to account
for cascading carries. We analyze the impact of tokenization strategies
on arithmetic performance and show that all investigated models,
regardless of tokenization, are inherently limited in the addition of
multiple operands due to their reliance on a one-digit lookahead
heuristic. Our findings reveal fundamental limitations that prevent LLMs
from generalizing to more complex numerical reasoning.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Interpretability, LLMs, Limitations of LLMs, Probing
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 3149
Loading