What Algorithms can Transformers Learn? A Study in Length Generalization
Keywords: length generalization, systematic generalization, understanding, transformer, scratchpad, LLM, algorithmic reasoning
TL;DR: We show that length generalization of Transformer models trained from scratch strongly correlates with the simplicity of the true RASP-L program for the task.
Abstract: Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. In this work, we focus on length generalization, and we propose a unifying framework to understand when and how Transformers can be expected to length generalize on a given task. First, we show that there exist algorithmic tasks for which standard decoder-only Transformers trained from scratch naturally exhibit strong length generalization. For these tasks, we leverage the RASP programming language (Weiss et al., 2021) to show that the correct algorithmic solution which solves the task can be represented by a simple Transformer. We thus propose and give evidence for the RASP-Generalization Conjecture: Transformers tend to learn a length-generalizing solution if there exists a short RASP-L program that works for all input lengths. We then leverage our insights to give new scratchpad formats which yield strong length generalization on traditionally hard tasks (such as parity and addition). Overall, our work provides a novel perspective on the mechanisms of length generalization and the algorithmic capabilities of Transformers.
Submission Number: 47
Loading