ELSAA: Efficient Low-Rank and Sparse Attention Approximation for Training Transformers

08 May 2026 (modified: 09 May 2026)ICML 2026 Workshop CoLoRAI SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Linear Attention, Sparse + Low rank Attention
Abstract: The quadratic N × N attention score matrix remains a central obstacle to extending Transformers to longer input lengths. Existing efficient attention methods usually reduce this bottleneck by either imposing sparsity, so that each query attends to only a small subset of keys, or by using low-rank/kernel sketches, so that global interactions are compressed into a lower-dimensional representation. We propose ELSAA, an efficient low-rank and sparse approximation of attention. Importantly, ELSAA does not decompose the learned projection or output matrices of the Transformer into sparse and low-rank factors. Instead, after dense projections produce Q, K, V , ELSAA approximates the induced attention score operator itself: a sparse branch captures selected high- similarity interactions, while a low-rank branch summarizes diffuse global interactions. Since the two branches can be normalized over supports with very different denominator mass, ELSAA introduces a denominator-aware fusion term that scales the sparse branch according to its estimated attention mass relative to the low-rank branch. This gives a practical framework for constructing low-rank and sparse attention outputs without materializing the full quadratic score matrix, aiming to enable longer-context training while preserving both sharp token-level interactions and broad contextual mixing.
Submission Number: 128
Loading