Keywords: Reasoning Model, Model Compression, Efficiency
TL;DR: We introduce TSAR, a training-free framework that dramatically accelerates LLM reasoning by adaptively reducing both unnecessary thinking steps and computational precision, achieving massive speedups without sacrificing accuracy.
Abstract: Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) achieves remarkable performance but suffers from significant computational overhead. CoT reasoning exhibits redundancy across two critical dimensions: temporal redundancy, where reasoning steps may be unnecessary, and spatial redundancy, where computations can be performed at reduced precision. While existing approaches require expensive dataset construction and model fine-tuning to improve reasoning efficiency, we propose Temporal-Spatial Adaptive Reasoning (TSAR), a training-free framework that jointly exploits both redundancy dimensions through coordinated optimization. TSAR segments reasoning based on Dewey's reflective thinking model, employs progressive precision reduction that adapts to both reasoning phases and progress, and coordinates termination decisions through entropy-based confidence estimation. Our adaptive scheduler prevents precision-induced errors while enabling compound efficiency gains. Extensive evaluation on diverse reasoning tasks demonstrates up to 12.4× speedup while maintaining the accuracy, establishing coordinated multi-dimensional redundancy exploitation as superior to conventional optimization strategies.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6555
Loading