Keywords: In-Context Learning, Linear Attention, Large Language Models, Instruction-Tuning
TL;DR: In our paper we demonstrate that ICL on a single layer of linearized attention is the same gradient descent with a specific dataset. This is then extended to show that with multiple layers it forms a Greedy Layer-wise training algorithm
Abstract: In-context learning (ICL) is a powerful capability of large language models that have shown up in past years. Despite its impact, the exact mechanism behind ICL is still only understood to a very limited capacity. In this paper, we suggest that ICL on a single linearized self-attention layer is equivalent to a single step of gradient descent with a specific dataset. This property is shown without additional assumptions on the model parameters which is required in other work in the field. We then extend our setting to a more realistic multi-layer framework and observe that in-context learning resembles using a greedy-layer-wise algorithm to update the weights within a large language model with multiple layers. Last but not least, we extend our theoretical conclusions to the autoregressive setting. We notice that many other works comparing ICL to gradient descent are restricted to very specific settings that do not contain a causal mask.
Submission Number: 156
Loading