Keywords: attention, kv cache, mla, deepseek, sparse attention
Abstract: DeepSeek Sparse Attention (DSA) introduces a lightning indexer together with a fine-grained sparse multi-head latent attention (Sparse MLA) mechanism. The indexer efficiently computes relevance scores for each query token and retrieves only the key–value pairs corresponding to the top-k scores. Compared to prior chunk-based sparse attention and sliding-window attention methods, DSA provides greater flexibility and modeling capacity.
Despite its effectiveness, DSA has several limitations. Because the lightning indexer is trained by distilling from the main branch as a teacher, DSA only supports continued pre-training and cannot be trained from scratch. Moreover, training DSA is computationally expensive. It requires an explicit dense warm-up stage to align the indexer with dense MLA. During sparse training, the main model is optimized with a cross-entropy loss to adapt from dense to sparse MLA, while a KL-divergence loss is simultaneously applied to continually align the indexer with sparse MLA.
To address these limitations, we propose NSMLA, which employs a native indexer during pre-training. The native indexer introduces no additional parameters, yet enables efficient top-k token selection and more effective optimization of sparse MLA. In a subsequent annealing stage, the native indexer is transformed into a memory-efficient lightning indexer, allowing the model to adapt in advance to faster inference-time execution.
The native indexer is functionally equivalent to the teacher module in DSA, and therefore more faithfully captures the main model’s token preferences. As a result, NSMLA eliminates the need for an expensive dense warm-up stage, requires no KL-divergence loss, and avoids gradient updates to the indexer. The lightning indexer shares a similar architecture with the student module in DSA. Although the conversion introduces a small temporary performance drop, this loss is recovered during the annealing stage.
Experiments on DeepSeek-V2-Lite trained on the Open-Thought dataset show that NSMLA consistently outperforms DSA, while substantially reducing both the number of training steps and the per-step computational cost required for attention-score alignment.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Efficient/Low-Resource Methods for NLP, Language Modeling
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English, Chinese
Submission Number: 78
Loading