Keywords: Attention, Efficient AI, Large Language Model, Inference Acceleration
TL;DR: BD Attention: the first lossless algorithmic acceleration of attention
Abstract: Attention is a core operation in large language models (LLMs) and vision-language models (VLMs). We present BD Attention (**BDA**), the first *lossless algorithmic reformulation* of attention. BDA is enabled by a simple matrix identity from Basis Decomposition (**BD**), which restructures multi-head projections into a compact form while preserving exact outputs. Unlike I/O-aware system optimizations such as FlashAttention, BDA provides a mathematically guaranteed acceleration that is architecture-agnostic. On DeepSeek-V2-Lite (16B, FP16), BDA requires only **4s** of offline preparation **with no retraining required** and, on modern GPUs, achieves **32\% faster** key/value projections and **25\% smaller** weights, while increasing end-to-end perplexity (PPL) by just **0.02\%** (FP16) or **0.0004\%** (FP32)—a negligible effect on model performance. These results position BDA as the first theoretically exact method for lossless attention acceleration that is complementary to existing engineering-level optimizations. Our code is available at https://anonymous.4open.science/r/Basis-decomp-57B8.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 10387
Loading