Provable Length Generalization in Sequence Prediction via Spectral Filtering

Annie Marsden; Evan Dogariu; Naman Agarwal; Xinyi Chen; Daniel Suo; Elad Hazan

Provable Length Generalization in Sequence Prediction via Spectral Filtering

Annie Marsden, Evan Dogariu, Naman Agarwal, Xinyi Chen, Daniel Suo, Elad Hazan

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We define a measure of length generalization and provide provable guarantees based on this notion for the spectral filtering algorithm.

Abstract: We consider the problem of length generalization in sequence prediction. We define a new metric of performance in this setting – the Asymmetric-Regret– which measures regret against a benchmark predictor with longer context length than available to the learner. We continue by studying this concept through the lens of the spectral filter-ing algorithm. We present a gradient-based learn-ing algorithm that provably achieves length generalization for linear dynamical systems. We conclude with proof-of-concept experiments which are consistent with our theory.

Lay Summary: Many modern machine learning problems involve making predictions using a history of past observations (text generation from an LLM, weather prediction, etc.). A desirable property for an algorithm is the ability to effectively use longer histories than those that are seen in the training data, which is referred to as “length generalization”. We investigate this question theoretically through the lens of a powerful framework for describing algorithmic learning, known as online learning. In our paper, we first introduce a notion – the Asymmetric-Regret – to concretely and mathematically define what it means for an algorithm to exhibit length generalization. We then look at the linear dynamical system (LDS): a simple but general problem which (1) allows provable learning algorithms and (2) teaches us about algorithmic properties that hold on more complicated problems. For our main result, we prove that a particular method for learning an LDS, which is known as spectral filtering and is currently being used as a layer in modern deep neural networks for sequential data, length generalizes without modification. This work provides a natural language to describe length generalization of learning algorithms. Furthermore, the length generalization of spectral filtering is surprising and indicative of its usefulness in neural networks.

Primary Area: General Machine Learning->Sequential, Network, and Time Series Modeling

Keywords: sequence prediction; length generalization

Submission Number: 12021

Loading