Keywords: Efficient Reasoning, Efficient Inference Methods, Chain of Thought (CoT), Adaptive Halting
TL;DR: LEASH is a training-free, decoding-time heuristic that watches token-level entropy slope and top-logit margin for a plateau, then halts chain-of-thought and asks for the final answer. No extra models or retraining needed.
Abstract: Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models. However, generating full, fixed-length rationales is computationally wasteful, inflating both token usage and latency. We introduce **LEASH**: **L**ogit-**E**ntropy **A**daptive **S**topping **H**euristic, a training-free decoding algorithm that adaptively halts rationale generation. **LEASH** monitors two intrinsic signals: the slope of token-level entropy and the improvement in the top-logit margin. It terminates the generation once both signals plateau, indicating the model has reached a stable reasoning state. Across four instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, **LEASH** reduces average token generation by $\approx$ 30--35\% and latency by $\approx$ 27\%, while incurring a $\approx$ 10 p.p. accuracy drop relative to CoT. **LEASH** is model-agnostic and requires no additional training or supervision, offering a simple and efficient alternative to CoT decoding.
Submission Number: 298
Loading