Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

ICLR 2026 Conference Submission24868 Authors

20 Sept 2025 (modified: 02 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Long-context language models, Sparse attention, Dynamic sparsity, On-device inference, Efficient LLMs
Abstract: The quadratic cost of attention hinders the scalability of long-context LLMs, particularly in resource-constrained settings. While attention is known to be often sparse, existing static sparse methods such as sliding windows or global tokens cannot adapt to task or input dependent variations in attention. While there are recently proposed dynamic approaches for sparse attention, they still depend on predefined templates or heuristic mechanisms that reduces generality and may prune tokens that remain contextually important. As such, we introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining of the base LLM. Our proposed DHSA adaptively segments sequences into variable-length chunks, then computes chunk representations by aggregating the token embeddings within each chunk. To avoid the bias introduced by varying chunk lengths, we apply a length-normalized aggregation that scales the averaged embeddings by the square root of the chunk size. Finally, DHSA upsamples the chunk-level similarity scores to the token level to produce importance scores that determine which token-level interactions are preserved. Our experiments with Needle-in-a-Haystack and LongBench show that DHSA preserves near-dense accuracy even in highly sparse regimes, yielding 12–20% relative accuracy gains over Block Sparse Attention at comparable prefill cost. Using a FlashAttention-2 (FA2)–based kernel, DHSA achieves up to a $10\times$ prefill speedup over dense FA2 at 128K context length. On Llama-3.1-8B (4-bit), DHSA scales to 100K context with strong accuracy and competitive latency on a single 24 GB GPU, where dense kernels fail between 16K and 64K. The implementation supports both GPU and CPU backends and is compatible with multiple open-weight model families. These results highlight DHSA as an efficient and adaptable solution for long-context inference for on-device LLMs.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24868
Loading