TL;DR: We find that decoupling the key and value in attention helps to alleviate the requirements of the depth and width of induction heads, thereby improving language modeling.
Abstract: Current large language models (LLMs) predominantly rely on decode-only transformer architectures, which exhibit exceptional in-context learning (ICL) capabilities. It is widely acknowledged that the cornerstone of their ICL ability lies in the induction heads mechanism, which necessitates at least two layers of attention. To more effectively harness the model's induction capabilities, we revisit the induction heads mechanism and provide theoretical proof that KV shifting attention reduces the model's dependency on the depth and width of the induction heads mechanism. Our experimental results confirm that KV shifting attention enhances the learning of induction heads and improves language modeling performance. This leads to superior performance or accelerated convergence, spanning from toy models to pre-trained models with over 10 billion parameters.
Lay Summary: In order to better understand the contextual learning ability of LLM, we conducted research on its key mechanism, induction heads. And in order to unlock a better mechanism for in production heads, a convolution operation was proposed for the key and value of attention. Experimental results have shown that the addition of temporal convolution can effectively improve the language modeling ability of the model.
Primary Area: Deep Learning->Large Language Models
Keywords: Attention mechanism, Large Language model, Induction heads
Flagged For Ethics Review: true
Submission Number: 2020
Loading