Keywords: LLM, Transformers, Attention
TL;DR: Pruning attention weights by thresholding against a calibrated threshold offers processing 10x less attention elements and loading 3-10x less V-rows from the KV cache.
Abstract: We present Top-Theta (Top-θ) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-θ achieves 3-10x reduction in $V$-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7914
Loading