On learning linear dynamical systems in context with attention layers

Published: 26 Jan 2026, Last Modified: 20 May 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: in-context learning; linear attention; linear dynamical systems; kalman filter; time series
TL;DR: This paper studies how linear attention layers in-context learn linear dynamical systems and shows the optimal weight construction implements one step of Gradient Descent relative to an autoregression objective of window size one.
Abstract: This paper studies the expressive power of linear attention layers for in-context learning (ICL) of linear dynamical systems (LDS). We consider training on sequences of inexact observations produced by noise-corrupted LDSs, with all perturbations being Gaussian. Importantly, this non-i.i.d. data setting is a significant step towards modeling real-world scenarios. We provide the optimal weight construction for a single linear-attention layer and show its equivalence to one step of Gradient Descent relative to an autoregression objective of window size one. Guided by experiments, we uncover a connection to a generalization of the Preconditioned Conjugate Gradient method for larger window sizes. We back our findings with numerical evidence. These results add to the existing understanding of transformers' expressivity as in-context learners and offer plausible hypotheses for recent observations that place their performance on par with that of the Kalman Filter --- the optimal model-dependent learner for this setting.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19820
Loading