Perfect Recall, Parallel Efficiency: Interleaved DeepSeek Sparse Attention for Million-Token-Context Decoding
Keywords: DeepSeek, Sparse Attention, Long Context
TL;DR: We propose a probabilistic approach to dynamic sparse attention, designing a high-recall and parallel-friendly DSA structure to accelerate the distributed decoding of LLMs for long sequences.
Abstract: Token-level dynamic sparse attention exemplified by DeepSeek Sparse Attention (DSA) selects the globally most relevant key-value tokens via an exact Top-$K$ operator, achieving superior model quality over block-level alternatives. However, this exact selection creates a severe distributed inference bottleneck: enforcing an exact global Top-$K$ across GPUs inevitably incurs either redundant full-context retrieval or costly multi-stage cross-device synchronization, which largely negates the computational advantages of DSA at long context lengths. Motivated by the mathematical properties of the softmax function, we hypothesize that incorporating additional, marginally relevant context has negligible impact on the attention output. Leveraging this insight, we propose Interleaved DeepSeek Sparse Attention (IDSA), which distributes tokens across GPUs in an interleaved layout so that each device performs only a relaxed local Top-$m$ selection. Under this layout, the union of independent per-GPU Top-$m$ selections near-completely covers the globally most relevant Top-$K$ tokens. This allows each device to proceed with its local selection with minimal cross-GPU overhead while avoiding both expensive full-context Top-$K$ computation and multi-stage cross-GPU merging, enabling a not only distributed but also synchronization-efficient inference pipeline. Without any retraining, IDSA delivers dramatic throughput gains for context lengths exceeding 100K tokens on DeepSeek-V3.2, while preserving equivalent or better reasoning performance as demonstrated on the AIME benchmarks.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 53
Loading