Hierarchical Routers for Efficient Top-k Retrieval in Sparse Attention

Hierarchical Routers for Efficient Top-k Retrieval in Sparse Attention

ICLR 2026 Conference Submission13669 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient Attention, LLM, Sparse Attention

TL;DR: Our paper introduces a hierarchical routing framework with balanced bucket assignments and beam-search top-k retrieval that enables efficient sparse attention.

Abstract: Attention mechanisms have achieved remarkable success in deep learning through parallel searching for the most relevant tokens in large-scale data. However, both the memory and computational costs of self-attention scale quadratically with sequence length, making it infeasible for long sequences. Recent sparse top-$k$ attention methods can achieve performance comparable to full attention with much lower memory and computational overhead. Nevertheless, they often rely on graph- or tree-based index structures, which are too slow for batches of token sequences to rebuild across layers or heads, or use partition-based techniques which lack precision. To address this issue, we propose a search algorithm for sparse attention: Hierarchical Router Algorithm, HiRouter, which can efficiently construct indexing structures and dynamically retrieve top-k tokens on a per-sequence basis, striking a better balance between speed and accuracy. HiRouter employs a multi-level routing mechanism that hierarchically partitions tokens into discrete buckets along a learned tree structure with O(T) to the sequence length T. Notably, our dual entropy loss directly regularizes embeddings, using affinity for stronger sample–centroid alignment to improve top-$k$ recall and balanced buckets to ensure efficient GPU parallelism. HiRouter outperforms FlashAttention in speed on long sequences while matching or surpassing the accuracy of full attention, offering a compelling solution for scalable and efficient attention mechanisms.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 13669

Loading