Keywords: transformer, induction head, associative memory, positional encoding
TL;DR: This paper investigates how a two-layer transformer thoroughly captures in-context information and balances it with pretrained bigram knowledge in next token prediction, from the viewpoint of associative memory.
Abstract: Induction head mechanism is a part of the computational circuits for in-context learning (ICL) that enable large language models (LLMs) to adapt to new tasks without fine-tuning. Most existing work explains the training dynamics behind acquiring such a powerful mechanism. However, it is unclear how a transformer extract information from long contexts and then use it to coordinate with global knowledge acquired during pretraninig. This paper considers weight matrices as associative memory to investigate how an induction head functions over long contexts and balances in-context and global bigram knowledge in next token prediction. We theoretically analyze the representation of the learned associative memory in attention layers and the resulting logits when a transformer is given prompts generated by a bigram model. In the experiments, we design specific prompts to evaluate whether the outputs of the trained transformer align with the theoretical results.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 1617
Loading