Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Transformers, Attention
TL;DR: Pruning attention weights by thresholding against a calibrated threshold offers processing 10x less attention elements and loading 3-10x less V-rows from the KV cache.
Abstract: We present Top-Theta (Top-θ) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-θ achieves 3-10x reduction in $V$-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7914
Loading