Attend or Perish: Benchmarking Attention on Algorithmic Reasoning

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: transformers, generalization, length extrapolation, algorithmic reasoning, evaluation, interpretability
TL;DR: We attribute models' inability to extrapolate to failures of attention mechanism, to learn a generalized pattern based on fast-rotating RoPE embeddings.
Abstract: While Transformer models can learn algorithmic tasks and generalize reliably to unseen, in-distribution data, they often fail catastrophically when required to extrapolate to sequence lengths beyond their training regime. This paper investigates the root cause of this critical failure in length generalization. Using AttentionSpan—a benchmark of algorithmic tasks such as addition and multiplication, specifically designed to enable interpretability and facilitate inspection of internal model computations—we analyze the model’s behavior on length extrapolation. Our findings indicate that this failure does not reflect a fundamental limitation in the model’s ability to generalize or to induce general rules for the task. Instead, we attribute the problem to inconsistent attention patterns—information retrieval strategies learned by individual attention heads—which fail to remain stable as sequence length increases. This inconsistency disrupts the execution of the algorithm at novel lengths. We show that fine-tuning just a single column of the Key and Query projection matrices in all attention heads on sequences longer than those seen during initial training is sufficient for the model to perform well on these same longer sequences. While this does not extend extrapolation beyond the fine-tuned lengths, it demonstrates that robust length generalization can be achieved with a minimal adjustment to attention weights, suggesting that such failures could be addressed early in training. We make our benchmark and code publicly available.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 11209
Loading