Retrieval-Augmented Generation as In-Context Optimization: A Gradient Descent Perspective

ICLR 2026 Conference Submission21174 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: retrieval-augmented generation (RAG), in context learning, Gradient Descent
Abstract: Currently, it remains unclear whether in-context learning (ICL) can serve as an al- ternative mechanism for retrieval-augmented generation (RAG), and its underlying operation is still poorly understood and largely intuitive. In this paper, we propose that trained Transformers can be viewed as performing retrieval-augmented gen- eration through gradient descent. We start by proving a weight construction and showing the equivalence of data transformations induced by linear self-attention- based Transformer and RAG training on a regression loss. Motivated by this construction, we empirically demonstrate that, when trained on simple regression tasks, self-attention-only Transformers exhibit strong similarity to RAG models trained via gradient descent. This allows us, at least within the scope of regression problems, to gain a mechanistic understanding of how in-context learning can be leveraged to optimize RAG. Moreover, we observe that the distribution of the data critically affects the generalizability of the learned models in the non-linear setting, so we propose strategies to enhance the robustness of in-context learning (ICL) against distributional variability encountered in practice. Among these, we explore normalization techniques as one representative approach, showing that they can effectively improve both stability during training and generalization across domains.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 21174
Loading