FLARE: Fast Low-rank Attention Routing Engine

TMLR Paper8716 Authors

01 May 2026 (modified: 24 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The quadratic complexity of self-attention limits the scalability of transformers on long sequences. We introduce **Fast Low-rank Attention Routing Engine (FLARE)**, a token-mixing operator that realizes low-rank attention by routing information through a small set of latent tokens. Each layer induces an input-input token mixing matrix of rank at most $M$ via a minimal encode-decode factorization implemented using only two standard scaled dot-product attention (SDPA) calls. Because the dominant $\mathcal{O}(NM)$ computation is expressed purely in terms of standard SDPA, FLARE is compatible with fused attention kernels and avoids materializing $M\times N$ projection matrices. FLARE further assigns disjoint latent slices to each attention head, yielding a mixture of head-specific low-rank pathways. Empirically, FLARE scales to **one-million-point unstructured meshes on a single GPU**, delivers strong results across PDE surrogate benchmarks, and performs competitively on the Long Range Arena suite. We additionally release a large-scale additive manufacturing benchmark dataset.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Tae-Hyun_Oh3
Submission Number: 8716
Loading