Attention Bias as an Inductive Bias: How to Teach Transformers Simple Arithmetic

Published: 10 Oct 2024, Last Modified: 31 Oct 2024MATH-AI 24EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformer, Attention, Inductive Bias, Arithmetic, Length Generalization
TL;DR: We investigate the ability for transformer models to do arithmetic in the perspective of inductive learning.
Abstract: In this paper, we study the transformer model's capability in learning arithmetic from an inductive learning perspective and draw attention to the importance of inductive biases. We first introduce a definition of length generalization, requiring the model to maintain near perfect accuracy on samples with length at least 10 times the training length, as an indicator of successful learning. Through experiments and attention analysis, we show that the failure of the vanilla Transformer on learning arithmetic is due to inadequate inductive biasing. We then present Attention Bias Scaffolding (ABS) which uses attention masking to enforce the necessary inductive bias, making it the first Transformer-based architecture to achieve complete generalization on several arithmetic tasks such as addition and parity. Additionally, we introduce Attention Bias Calibration (ABC), a calibration stage that allows the model to learn the proper attention biases, and obtain complete length generalization automatically on tasks that could interpolate. Finally, we show that ABC bears remarkable similarities to RPE and LoRA, which may indicate the potential for applications to more complex tasks.
Concurrent Submissions: N/A
Submission Number: 17
Loading