Keywords: retrieval-augmented generation (RAG), in context learning, Gradient Descent
Abstract: Currently, it remains unclear whether in-context learning (ICL) can serve as an al-
ternative mechanism for retrieval-augmented generation (RAG), and its underlying
operation is still poorly understood and largely intuitive. In this paper, we propose
that trained Transformers can be viewed as performing retrieval-augmented gen-
eration through gradient descent. We start by proving a weight construction and
showing the equivalence of data transformations induced by linear self-attention-
based Transformer and RAG training on a regression loss. Motivated by this
construction, we empirically demonstrate that, when trained on simple regression
tasks, self-attention-only Transformers exhibit strong similarity to RAG models
trained via gradient descent. This allows us, at least within the scope of regression
problems, to gain a mechanistic understanding of how in-context learning can
be leveraged to optimize RAG. Moreover, we observe that the distribution of the
data critically affects the generalizability of the learned models in the non-linear
setting, so we propose strategies to enhance the robustness of in-context learning
(ICL) against distributional variability encountered in practice. Among these, we
explore normalization techniques as one representative approach, showing that they
can effectively improve both stability during training and generalization across
domains.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 21174
Loading