Keywords: attention mechanisms, low-rank approximation, PDE surrogate modeling, scientific machine learning, point clouds, unstructured meshes, efficient transformers, additive manufacturing, 3D geometry
TL;DR: FLARE is an efficient transformer that leverages the low-rank structure of attention to balance accuracy and efficiency, scaling to extremely large token counts on a single GPU.
Abstract: The quadratic complexity of self-attention limits its applicability and scalability on long sequences.
We introduce Fast Low-rank Attention Routing Engine (FLARE),
a simplified latent-attention mechanism wherein each attention head encodes $N$ input tokens onto $M \ll N$ latent tokens and immediately deocdes back.
This produces an implicit rank $\leq M$ attention operator using only two cross-attention steps providing a simple and mathematically transparent low-rank formulation of self-attention that can be implemented in $\mathcal{O}(MN)$ time.
Crucially, FLARE eliminates latent-space self-attention and, by expressing both encode and decode steps as fused-attention operations, avoids ever materializing the $M\times N$ projection matrices.
Moreover, by assigning each head independent latent token slices, FLARE realizes multiple parallel low-rank pathways that collectively approximate a richer global attention pattern without sacrificing efficiency.
Empirically, FLARE trains end-to-end on one-million-point unstructured meshes on a single GPU, achieves state-of-the-art accuracy on PDE surrogate benchmarks, and outperforms general-purpose efficient-attention methods on the Long Range Arena suite.
We additionally release a large-scale additive manufacturing dataset to spur further research.
Supplementary Material: zip
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 4780
Loading