Foreign Sparse Attention: Effective Distillation into Sparse Attention

Published: 10 Jun 2025, Last Modified: 10 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: sparse attention
Abstract: Transformer architectures have been often maligned for the quadratic complexity of global selfattention, but global self-attention has proven critical for performance in many applications. Recently, reasoning models have pushed the limits of token generation, with models generating tens of thousands of tokens in their chain-of-thought for a single query. Now more than ever, efficient attention alternatives are critical. Native sparse attention is a promising recent alternative to global self-attention, but has not been validated at the scale of frontier pretrained model releases. In this work, we present Foreign Sparse Attention: an effective and efficient distillation method for transferring global self-attention into native sparse attention. We validate that our distilled Qwen model performs competitively with the teacher, in some instances improving in accuracy on data we did not distill on while generating fewer tokens in its responses.
Submission Number: 34
Loading