Keywords: Linear Attention, Sparse + Low rank Attention
Abstract: The quadratic N × N attention score matrix remains a central obstacle to extending Transformers to longer input lengths. Existing efficient attention methods usually reduce this bottleneck by
either imposing sparsity, so that each query attends to only a small subset of keys, or by using
low-rank/kernel sketches, so that global interactions are compressed into a lower-dimensional
representation. We propose ELSAA, an efficient
low-rank and sparse approximation of attention.
Importantly, ELSAA does not decompose the
learned projection or output matrices of the Transformer into sparse and low-rank factors. Instead,
after dense projections produce Q, K, V , ELSAA
approximates the induced attention score operator itself: a sparse branch captures selected high-
similarity interactions, while a low-rank branch
summarizes diffuse global interactions. Since the
two branches can be normalized over supports
with very different denominator mass, ELSAA
introduces a denominator-aware fusion term that
scales the sparse branch according to its estimated
attention mass relative to the low-rank branch.
This gives a practical framework for constructing
low-rank and sparse attention outputs without materializing the full quadratic score matrix, aiming
to enable longer-context training while preserving both sharp token-level interactions and broad
contextual mixing.
Submission Number: 128
Loading