Sublinear Time Quantum Algorithm for Attention Approximation

ICLR 2026 Conference Submission14362 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: quantum computing, attention approximation, numerical linear algebra
TL;DR: We develop the first quantum algorithm to approximate the attention matrix and output any row of the attention matrix in sublinear in context length time.
Abstract: Given the query, key and value matrices $Q, K, V\in \mathbb{R}^{n\times d}$, the attention matrix is defined as $\mathrm{Att}(Q, K, V)=D^{-1}AV$ where $A=\exp(QK^\top/\sqrt{d})$ with $\exp(\cdot)$ applied entrywise, $D=\mathrm{diag}(A{\bf 1}_n)$. The attention matrix is the backbone of modern transformers and large language models, but explicitly forming the softmax matrix $D^{-1}A$ incurs $\Omega(n^2)$, motivating numerous approximation schemes that reduce runtime to $\widetilde O(nd)$ via sparsity or low-rank factorization. We propose a quantum data structure that approximates any row of $\mathrm{Att}(Q, K, V)$ using only row queries to $Q, K, V$. Our algorithm preprocesses these matrices in $\widetilde{O}\left( \epsilon^{-1} n^{0.5} \left( s_\lambda^{2.5} + s_\lambda^{1.5} d + \alpha^{0.5} d \right) \right)$ time, where $\epsilon$ is the target accuracy, $s_\lambda$ is the $\lambda$-statistical dimension of the exponential kernel defined by $Q$ and $K$, and $\alpha$ measures the row distortion of $V$. Each row query can be answered in $\widetilde{O}(s_\lambda^2 + s_\lambda d)$ time. To our knowledge, this is the first quantum data structure that approximates rows of the attention matrix in sublinear time with respect to $n$. Our approach relies on a quantum Nystr{\"o}m approximation of the exponential kernel, quantum multivariate mean estimation for computing $D$, and quantum leverage score sampling for the multiplication with $V$.
Primary Area: optimization
Submission Number: 14362
Loading