TL;DR: We propose a theoretical explanation for when a learner can learn a hypothesis that generalizes across input-length. We prove upper bounds on the minimum input-length needed during training to length generalize, for various function classes.
Abstract: Length generalization is the ability of a learning algorithm to learn a hypothesis which generalizes to longer inputs than the inputs in the training set. In this paper, we provide provable guarantees of length generalization for various classes of functions in an idealized setting. First, we formalize the framework of non-asymptotic length generalization, which requires a computable upper bound for the minimum input length that guarantees length generalization, as a function of the complexity of ground-truth function under some given complexity measure. We refer to this minimum input length to length generalize as length complexity. We show the Minimum-Complexity Interpolator learning algorithm achieves optimal length complexity. We further show that whether a function class admits non-asymptotic length generalization is equivalent to the decidability of its language equivalence problem, which implies that there is no computable upper bound for the length complexity of Context-Free Grammars. On the positive side, we show that the length complexity of Deterministic Finite Automata is $2n - 2$ where $n$ is the number of states of the ground-truth automaton. Our main results are upper bounds of length complexity for a subset of a transformer-related function class called C-RASP (Yang & Chiang, 2024). We show that the length complexity of 1-layer C-RASP functions is $O(T^2)$ when the ground-truth function has precision $T$, and that the length complexity of 2-layer C-RASP functions is $O(T^{O(K)})$ when the ground-truth function has precision $T$ and $K$ heads.
Lay Summary: We study Length Generalization, an empirical phenomenon observed in the training of Large Language Models (LLMs). Length generalization is when a model, which is fitted to data of a certain length, can also perform well on data of a larger length. An example is when a model which was trained to multiply 20 digit by 20 digit numbers has also learned to multiply 100 digit by 100 digit numbers correctly. Length generalization is a non-trivial phenomenon, and it is useful for training models efficiently.
While most Length Generalization studies so far have been empirical, our paper provides a theoretical perspective on Length Generalization. Our paper's aim is to contribute to a better understanding of when Length Generalization occurs or does not occur.
Our theoretical paper derives, with proof, mathematical expressions for the smallest length of training data which can guarantee that length generalization occurs, in an idealized setting and using an idealized learning algorithm. We prove such length-generalization guarantees for different classes of ground-truth functions that one may want their model to learn. Our main results pertain to ground-truth functions of the form of a C-RASP program, which is a theoretical abstraction of a transformer, the architecture of LLMs. We derive mathematical expressions for the smallest length of training data which can guarantee length generalization, when learning a subclass of C-RASP.
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Model, Length Generalization, Machine Learning Theory
Submission Number: 9132
Loading