On learning linear dynamical systems in context with attention layers

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: in-context learning; linear attention; linear dynamical systems; kalman filter; time series
TL;DR: This paper studies how linear attention layers in-context learn linear dynamical systems and shows the optimal weight construction implements one step of Gradient Descent relative to an autoregression objective of window size one.
Abstract: This paper studies the expressive power of linear attention layers for in-context learning (ICL) of linear dynamical systems (LDS). We consider training on sequences of inexact observations produced by noise-corrupted LDSs, with all perturbations being Gaussian; importantly, we study the non-i.i.d. setting as it is closer to real-world scenarios. We provide the optimal weight construction for a single linear-attention layer and show its equivalence to one step of Gradient Descent relative to an autoregression objective of window size one. Guided by experiments, we uncover a relation to the Preconditioned Conjugate Gradient method for larger window sizes. We back our findings with numerical evidence. These results add to the existing understanding of transformers' expressivity as in-context learners, and offer plausible hypotheses for experimental observations whereby they compete with Kalman filters --- the optimal model-dependent learners for this setting.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19820
Loading