LUCID: Attention with Preconditioned Representations

LUCID: Attention with Preconditioned Representations

ICLR 2026 Conference Submission22196 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Attention Mechanisms

Abstract: Softmax-based dot-product attention is a cornerstone of Transformer architectures, enabling remarkable capabilities such as in-context learning. However, as context lengths increase, a fundamental limitation of the softmax function emerges: it tends to diffuse probability mass, assigning non-trivial weights to irrelevant tokens. This dilutes focus and degrades precision, especially in long-sequence scenarios. We introduce \textit{LUCID Attention}, an architectural modification that applies a preconditioner to the attention probabilities. This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space, thus allowing the query to focus on important keys among large number of keys accurately. If a query $\mathbf{q}$ is highly similar to a key $\mathbf{k}$, LUCID outputs the corresponding value vector $\mathbf{v}$ with minimal blending from other tokens. This mechanism enables significantly sharper and more precise attention distributions. LUCID is designed as a drop-in replacement for existing attention mechanisms, retaining the same asymptotic complexity. We validate our approach by training $\sim$ 1 billion parameter language models, pre-trained on a 2K sequence length and then fine-tuned up to a 65K sequence length. Our results demonstrate improved next-token prediction loss and significant gains on long-context retrieval tasks. LUCID shows an average improvement of $\sim$20\% in single and multi-needle in a haystack benchmarks compared to standard attention.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 22196

Loading