On the "Induction Bias" in Sequence Models

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: State Tracking, Transformers, Recurrent Neural Networks, Sample Efficiency, Data Efficiency, Chain-of-Thought, Inductive Bias, Length Generalization, Modular Arithmetic, Weight Sharing
TL;DR: We show transformers exhibit poor in-distribution data efficiency on state-tracking tasks compared to RNNs, primarily because they learn isolated, length-specific solutions rather than sharing mechanisms across different sequence lengths.
Abstract: Despite the remarkable practical success of transformer-based language models, recent work has raised concerns about their ability to perform state tracking, in particular in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We empirically compare the data efficiency of transformers and recurrent neural networks (RNNs) and find that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs. Furthermore, we analyze the extent to which learned state-tracking mechanisms are shared across different sequence lengths. We show that transformers exhibit negligible or even detrimental weight sharing across lengths, indicating that they learn length-specific solutions in isolation. In contrast, recurrent models exhibit effective amortized learning by sharing weights across lengths. Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Challenge: This submission is an entry to the science of DL improvement challenge.
Submission Number: 44
Loading