Transformers as Stochastic Optimizers

Published: 18 Jun 2024, Last Modified: 18 Jun 2024ICML 2024 Workshop ICL PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: short paper (up to 4 pages)
Keywords: SGD, Adam
TL;DR: We introduce in-context stochastic algorithms
Abstract: In-context learning is a crucial framework for understanding the learning processes of foundation models. Transformers are frequently used as a useful architecture within this context. Recent experimental results have demonstrated that Transformers can learn algorithms such as gradient descent based on datasets. However, from a theoretical aspect, while Transformers have been shown to approximate non-stochastic algorithms, it has not been shown for stochastic algorithms such as stochastic gradietn descent. This study develops a theory on how Transformers represent stochastic algorithms in in-context learning. Specifically, we show that Transformers can generate truly random numbers by extracting the randomness inherent in the data and pseudo random numbers by implementing pseudo random number generators. As a direct application, we demonstrate that Transformers can implement stochastic optimizers, including stochastic gradient descent and Adam, in context.
Submission Number: 33
Loading