Sliding Window Attention for Reinforced Reasoning

ICLR 2026 Conference Submission9061 Authors

17 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: RNN, Recurrent LLM, Reason, RL
TL;DR: We are the first to show the potential of sliding window attention in Reinforced Reasoning, achieving better performance, longer context and higher throughput than self-attention.
Abstract: Large reasoning models such as DeepSeek-R1 employ reinforcement learning (RL) to incentivize the reasoning capability. As the context length growth, the quadratic complexity of self-attention (SA) prohibits scaling to longer contexts. Recently, hybrid, sparse and linear attention methods aim to reduce the cost of SA, yet suffer from costly retraining, high complexity or linear memory growth. To address it, we revisit sliding-window attention (SWA). It not only offers linear-time complexity and constant memory, enabling faster RL rollouts, but also facilitates cheap conversion from pretrained transformers. Notably, we prove that SWA can handle the reasoning tasks well due to the locality of thought. In this paper, we introduce Sliding Window Attention for Reinforced Reasoning (SWARR), a two-stage approach: (1) math-specific supervised fine-tuning to convert a pretrained SA model into a SWA as cold-start, and (2) RL optimization using DAPO to enhance reasoning capabilities. Under same settings, our SWARR outperforms SA by 1.78% on 1.5B, while delivering 6.2x higher throughput and 8x larger batch size, and 1.5x longer context under same memory budget. Our SWARR achieves the competitive performance among 1.5B and 7B models, surpassing the DeepSeek-R1-Distill-Qwen-1.5B and 7B by 1.9% and 3.4% respectively. To our knowledge, this is the first work to show that trained SWA is a competitive alternative to transformers, enabling efficient and scalable reasoning.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9061
Loading