Optimality and NP-Hardness of Transformers in Learning  Markovian Dynamical Functions

Yanna Ding; Songtao Lu; Yingdong Lu; Tomasz J Nowicki; Jianxi Gao

Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions

Yanna Ding, Songtao Lu, Yingdong Lu, Tomasz J Nowicki, Jianxi Gao

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: In-Context Learning, Markovian Dynamics, Transformers, Optimization

TL;DR: We study the expression and limitations of transformers in performing in-context learning of functions governed by Markovian dynamics.

Abstract: Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express in-context learning when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a multilayer LSA as performing preconditioned gradient descent to optimize multiple objectives beyond the square loss. These theoretical results are numerically validated using simplified transformers.

Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)

Submission Number: 19215

Loading