In-Context Deep Learning via Transformer Models

Weimin Wu; Maojiang Su; Jerry Yao-Chieh Hu; Zhao Song; Han Liu

In-Context Deep Learning via Transformer Models

Weimin Wu, Maojiang Su, Jerry Yao-Chieh Hu, Zhao Song, Han Liu

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We investigate the transformer's capability for in-context learning (ICL) to simulate the training process of deep models.

Abstract: We investigate the transformer's capability for in-context learning (ICL) to simulate the training process of deep models. Our key contribution is providing a positive example of using a transformer to train a deep neural network by gradient descent in an implicit fashion via ICL. Specifically, we provide an explicit construction of a $(2N+4)L$-layer transformer capable of simulating $L$ gradient descent steps of an $N$-layer ReLU network through ICL. We also give the theoretical guarantees for the approximation within any given error and the convergence of the ICL gradient descent. Additionally, we extend our analysis to the more practical setting using Softmax-based transformers. We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. The results show that ICL performance matches that of direct training.

Lay Summary: Training deep neural networks from scratch is expensive and time-consuming. We asked: can a powerful pretrained transformer model simulate the training of another deep model — without updating its own parameters? This question matters because it could make machine learning far more efficient and accessible. We show that transformers can perform such “in-context learning,” effectively simulating multiple steps of gradient descent by just observing example data. We construct a transformer architecture that replicates the training process of deep neural networks and provide theoretical guarantees on its accuracy and convergence. We also validate our results through experiments. Our results suggest a path where one foundation model can enable the training of many others, reducing redundancy and computational cost.

Primary Area: Deep Learning->Foundation Models

Keywords: foundation model, transformer, in-context learning

Submission Number: 6952

Loading