Keywords: Attention Mechanisms
Abstract: Softmax-based dot-product attention is a cornerstone of Transformer architectures, enabling remarkable capabilities such as in-context learning. However, as context lengths increase, a fundamental limitation of the softmax function emerges: it tends to diffuse probability mass, assigning non-trivial weights to irrelevant tokens. This dilutes focus and degrades precision, especially in long-sequence scenarios. We introduce \textit{LUCID Attention}, an architectural modification that applies a preconditioner to the attention probabilities. This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space, thus allowing the query to focus on important keys among large number of keys accurately. If a query $\mathbf{q}$ is highly similar to a key $\mathbf{k}$, LUCID outputs the corresponding value vector $\mathbf{v}$ with minimal blending from other tokens. This mechanism enables significantly sharper and more precise attention distributions. LUCID is designed as a drop-in replacement for existing attention mechanisms, retaining the same asymptotic complexity. We validate our approach by training $\sim$ 1 billion parameter language models, pre-trained on a 2K sequence length and then fine-tuned up to a 65K sequence length. Our results demonstrate improved next-token prediction loss and significant gains on long-context retrieval tasks. LUCID shows an average improvement of $\sim$20\% in single and multi-needle in a haystack benchmarks compared to standard attention.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22196
Loading