Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models

Deqing Fu; Tianqi CHEN; Robin Jia; Vatsal Sharan

Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models

Deqing Fu, Tianqi CHEN, Robin Jia, Vatsal Sharan

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: visualization or interpretation of learned representations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: in-context learning, transformers, linear regression

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: a transformer model trained on linear regression learns to implement an algorithm more similar to higher-order optimization methods, than to gradient descent.

Abstract: Transformers are remarkably good at *in-context learning* (ICL)---learning from demonstrations without parameter updates---but how they perform ICL remains a mystery. Recent work suggests that Transformers may learn in-context by internally running Gradient Descent, a first-order optimization method. In this paper, we instead demonstrate that Transformers learn to implement higher-order optimization methods to perform ICL. Focusing on in-context linear regression, we show that Transformers learn to implement an algorithm very similar to *Iterative Newton's Method*, a higher-order optimization method, rather than Gradient Descent. Empirically, we show that predictions from successive Transformer layers closely match different iterations of Newton's Method, with each middle layer roughly computing 3 iterations; Gradient Descent is a much poorer match for the Transformer. We also show that Transformers can learn in-context on ill-conditioned data, a setting where Gradient Descent struggles but Iterative Newton succeeds. Finally, we show theoretical results which support our empirical findings and have a close correspondence with them: we prove that Transformers can implement $k$ iterations of Newton's method with $\mathcal{O}(k)$ layers.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2944

Loading