Global Optimality of In-context Markovian Dynamics Learning

Yanna Ding; Songtao Lu; Yingdong Lu; Tomasz J Nowicki; Jianxi Gao

Global Optimality of In-context Markovian Dynamics Learning

Yanna Ding, Songtao Lu, Yingdong Lu, Tomasz J Nowicki, Jianxi Gao

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: transformers, in-context learning, Markov Chains, next token prediction

TL;DR: This study explores how transformers learn to perform next token prediction for Markov chains in-context, revealing that the global optimum adapts to Markovian dynamics, with empirical validations supporting our findings.

Abstract: Transformers have demonstrated impressive capability of in-context learning (ICL): given a sequence of input-output pairs of an unseen task, a trained transformer can make reasonable predictions on query inputs, without fine-tuning its parameters. However, existing studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs within a prompt. This paper seeks to unveil the mechanism of ICL for next-token prediction for Markov chains, focusing on the transformer architecture with linear self-attention (LSA). More specifically, we derive and interpret the global optimum of the ICL loss landscape: (1) We provide the closed-form expression of the global minimizer for single-layer LSA trained over random instances of length-2 in-context Markov chains, showing the Markovian data distribution necessitates a denser global minimum structure compared to ICL for linear tasks. (2) We establish tight bounds for the global minimum of single-layer LSA trained on arbitrary-length Markov chains. (3) Finally, we prove that multilayer LSA, with parameterization mirroring the global minimizer's structure, performs preconditioned gradient descent for a multi-objective optimization problem over the in-context samples, balancing a squared loss with multiple linear objectives. We numerically explore ICL for Markov chains using both simplified transformers and GPT-2-based multilayer nonlinear transformers.

Primary Area: learning theory

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9594

Loading