Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

ICLR 2026 Conference Submission24868 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Long-context language models, Sparse attention, Dynamic sparsity, On-device inference, Efficient LLMs

Abstract: The quadratic cost of attention hinders the scalability of long-context LLMs, particularly in resource-constrained settings. While attention is known to be often sparse, existing static sparse methods such as sliding windows or global tokens cannot adapt to task or input dependent variations in attention. While there are recently proposed dynamic approaches for sparse attention, they still depend on predefined templates or heuristic mechanisms that reduces generality and may prune tokens that remain contextually important. As such, we introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without any retraining of the base LLM. Our proposed DHSA adaptively segments sequences into variable-length chunks, then computes chunk representations by aggregating the token embeddings within each chunk. To avoid the bias introduced by varying chunk lengths, we apply a length-normalized aggregation that scales the averaged embeddings by the square root of the chunk size. Finally, DHSA upsamples the chunk-level similarity scores to the token level to produce importance scores that determine which token-level interactions are preserved. Our experiments with Needle-in-a-Haystack and LongBench show that DHSA matches dense attention in accuracy while reducing prefill latency by 20–60% and peak memory usage by 35% at 8K compared to eager attention. On Llama-3.1-8B (4-bit), DHSA scales to 100K context with high accuracy and competitive latency on a single 24 GB GPU, where dense kernels fail between 16K and 64K. Compared to representative sparsity baselines, DHSA achieves consistently higher accuracy, yielding 12–20% relative gains, with comparable prefill cost. These results highlight DHSA as an efficient and adaptable solution for long-context inference for on-device LLMs.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 24868

Loading