SADAS: Enabling Efficient On-Device Reasoning via State-Aware Dynamic Attention Scheduling

SADAS: Enabling Efficient On-Device Reasoning via State-Aware Dynamic Attention Scheduling

ACL ARR 2026 January Submission4831 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient On-Device Inference, Large Language Models, Chain-of-Thought Reasoning

Abstract: The deployment of reasoning-intensive LLMs on edge devices is severely hampered by the memory bandwidth bottlenecks inherent in processing long Chain-of-Thought sequences. To address this, we propose State-Aware Dynamic Attention Scheduling (SADAS), a novel framework that aligns computational complexity with cognitive phases. SADAS leverages control tokens to autonomously toggle between bandwidth-efficient Sliding Window Attention for intermediate reasoning steps and high-fidelity Full Attention for final answer integration. This mechanism reduces memory pressure without sacrificing global context. Experimental results on commodity CPUs demonstrate that SADAS transforms unresponsive long-chain reasoning into a real-time capability, achieving up to 3.88$\times$ end-to-end speedups. Crucially, SADAS-1.7B enjoys superior performance that matches or exceeds the results of leading 8B-scale reasoning models on challenging reasoning benchmarks like AIME24, while maintaining robust agentic instruction-following capabilities. To the best of our knowledge, SADAS is the first architecture to implement token-level dynamic attention mixing, establishing a new paradigm for responsive on-device intelligence.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: LLM Efficiency;efficient models;model architectures;reasoning

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English, Chinese

Submission Number: 4831

Loading