Keywords: Fast and Accurate Reasoning, Training-Free Sparse Attention, Efficient Inference
Abstract: Large reasoning models achieve strong performance through test-time scaling. However, this comes at the cost of substantial computational overhead, particularly from excessive token generation on short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention or costly retraining. We introduce LessIsMore, a {\em training-free} sparse attention mechanism for reasoning tasks. Unlike existing approaches that rely on head-specific local optimizations, LessIsMore leverages {\em global} attention patterns by aggregating token selections across heads with recent contextual information. This unified cross-head ranking enables more efficient token selection for future decoding layers, eliminating the need to maintain separate token subsets per head and improving both generalization and efficiency. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves---and in some cases improves---accuracy while achieving up to $1.6\times$ end-to-end decoding speedup compared to full attention. Moreover, LessIsMore attends to $2\times$ fewer tokens without accuracy loss and accelerates sparse attention computation by up to $1.72\times$ compared to existing methods.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4877
Loading