TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix

Published: 26 Jan 2026, Last Modified: 12 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language models, inference, multi-head latent attention, shared prefix
TL;DR: A novel MLA kernel that integrates naive and absorb implementations to fully exploit data reuse available in attention calculations as a result of the shared prefix
Abstract: Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art LLMs such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and prefill for their computational efficiency, existing decoding kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidth usage. However, the compute-bound nature of the absorb implementations prohibits performance benefits from data reuse opportunities in attention calculations, such as shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that combines naive and absorb formulations to harness the strengths of both. TyphoonMLA effectively leverages the shared prefix by applying the naive formulation to the compute-bound parts of attention calculations, while reducing the bandwidth requirements for non-shared parts by using the absorb formulation. As a result, TyphoonMLA improves the throughput of attention calculations in MLA architectures by up to 3× and 3.24× on NPU and GPUs, and boosts end-to-end throughput by up to 1.48× in tokens per second, with only a 3\% overhead in HBM size.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 5838
Loading