Keywords: Efficient On-Device Inference, Large Language Models, Chain-of-Thought Reasoning
Abstract: The deployment of reasoning-intensive LLMs on edge devices is severely hampered by the memory bandwidth bottlenecks inherent in processing long Chain-of-Thought sequences. To address this, we propose State-Aware Dynamic Attention Scheduling (SADAS), a novel framework that aligns computational complexity with cognitive phases. SADAS leverages control tokens to autonomously toggle between bandwidth-efficient Sliding Window Attention for intermediate reasoning steps and high-fidelity Full Attention for final answer integration. This mechanism reduces memory pressure without sacrificing global context. Experimental results on commodity CPUs demonstrate that SADAS transforms unresponsive long-chain reasoning into a real-time capability, achieving up to 3.88$\times$ end-to-end speedups. Crucially, SADAS-1.7B enjoys superior performance that matches or exceeds the results of leading 8B-scale reasoning models on challenging reasoning benchmarks like AIME24, while maintaining robust agentic instruction-following capabilities. To the best of our knowledge, SADAS is the first architecture to implement token-level dynamic attention mixing, establishing a new paradigm for responsive on-device intelligence.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: LLM Efficiency;efficient models;model architectures;reasoning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English, Chinese
Submission Number: 4831
Loading