On learning linear dynamical systems in context with attention layers

ICLR 2026 Conference Submission19820 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: in-context learning; linear attention; linear dynamical systems; kalman filter; time series
TL;DR: This paper studies how linear attention layers in-context learn linear dynamical systems and shows the optimal weight construction implements one step of Gradient Descent relative to an autoregression objective of window size one.
Abstract: This paper studies the expressive power of linear attention layers for in-context learning (ICL) of linear dynamical systems (LDS). We consider training on sequences of inexact observations produced by noise-corrupted LDSs, with all perturbations being Gaussian; importantly, we study the non-i.i.d.\ setting as it is closer to real-world scenarios. We provide the optimal weight construction for a single linear-attention layer and show its equivalence to one step of Gradient Descent relative to an autoregression objective of window size one. Guided by experiments, we posit an extension to larger window sizes. We back our findings with numerical evidence. These results add to the existing understanding of transformers' expressivity as in-context learners, and offer plausible hypotheses for experimental observations whereby they compete with Kalman filters --- the optimal model-dependent learners for this setting.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19820
Loading