Abstract: Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chain-of-Thought (CoT) techniques, we propose *Turing Programs*, a novel CoT strategy that decomposes an algorithmic task into steps mimicking the computation of a Turing Machine. This framework is both universal, as it can accommodate any algorithmic task, and simple, requiring only copying text from the context with small modifications. We show that by using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks: addition, multiplication and in-context SGD. We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task. Finally, we theoretically prove that transformers can implement Turing Programs, constructing a simple RASP (Weiss et al.) program that simulates an arbitrary Turing machine.
Lay Summary: If language models are trained only on 10-digit addition problems, can they solve 20-digit additions? This example illustrates length generalization—a model's ability to extrapolate to longer tasks than it encountered during training. While previous research has demonstrated length generalization for specific problems, we sought to develop a technique that would work across many different tasks.
Our solution was Turing Programs, a Chain-of-Thought technique that breaks complex problems into simple steps modeled after how Turing machines perform computations. In each step, the model would copy data from the previous step while performing a single “unit” of computation. Using this approach, we achieved length generalization across various tasks, including addition and multiplication.
Primary Area: Theory->Deep Learning
Keywords: length generalization, deep learning
Submission Number: 11539
Loading