Transformers Perform In-Context Learning through Neural Networks

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: In-Context Learning, Transformers, Neural-Networks, Gradient Desent
Abstract: Transformer based neural sequence models exhibit remarkable ability to do in-context learning. Given some training examples, a pre-trained model can make accurate predictions on a novel input. This paper studies why transformers can learn different types of function classes in context. We first show by construction that transformers implement approximate gradient descent on parameters of neural networks and provide an upper bound for number of heads, hidden dimension, and number of layers of the transformer. We also show that transformers can learn deep and narrow neural networks, which has better approximation capabilities compared to shallow and wide neural networks, using less resource. Our results move beyond linearity in terms of in-context learning instances and provide an understanding of why transformers can learn many types of function classes through the bridge of neural networks.
Supplementary Material: pdf
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7411
Loading