Keywords: Inference, Engineering for large LMs, Compute efficient LM
TL;DR: Incremental decoding slows attention, we propose new hardware-efficient variants that reorganize attention to preserve parallelization and model quality while boosting speed, GPU utilization, and throughput, all with minimal cache.
Abstract: The combination of excessive data movement, an expanding key-value cache, and the limited parallelism inherent in incremental decoding severely bottleneck attention. We explore the design of hardware-efficient attention optimized for LLM decoding. We examine how arithmetic intensity, parallelization, and model quality interact and assess whether the current architecture fully capitalizes on modern hardware. To maximize hardware-effiency, we first propose Group Tied Attention (GTA), a simple attention variant that combines and reuses key and value states to reduce memory transfers during incremental decoding while preserving model quality. We then introduce Group Latent Attention (GLA), a parallel-friendly latent attention combined with low-level optimization designed for fast decoding while showing high model quality. We empirically demonstrate the efficacy of these inference-aware variants in language modeling experiments, showing that GTA matches grouped query attention (GQA) quality with roughly 2x smaller KV cache, and GLA matches multi-head latent attention (MLA) but is easier to shard. Our optimized attention kernel for GLA is up to 2x faster than FlashMLA.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 1175
Loading