Pruning the Unsurprising: Efficient LLM Reasoning via First-Token Surprisal

ACL ARR 2026 January Submission3168 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Chain-of-Thought (CoT), Efficient Reasoning, CoT Compression
Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable capabilities by scaling up the length of Chain-of-Thought (CoT). However, excessively long reasoning traces pose substantial challenges for training cost and inference latency. While various CoT compression approaches have emerged to address this challenge, they face inherent trade-offs: token-level methods often disrupt syntactic and logical coherence, while step-level methods based on perplexity fail to reliably capture the logically critical reasoning steps because of the dilution of logical information. In this paper, we propose **ASAP** (**A**nchor-guided, **S**urpris**A**l-based **P**runing), a novel coarse-to-fine framework for CoT compression. ASAP first performs anchor-guided pruning to preserve the core reasoning structure, which efficiently reduces the search space for subsequent processing. Leveraging the insight that logical branching choices are concentrated at the onset of reasoning steps, it then enables logic-aware pruning by selecting logically essential reasoning steps based on a novel first-token surprisal metric. Finally, ASAP distills the models to autonomously generate and leverage these concise CoTs at inference time, enabling efficient reasoning. Experiments show that ASAP achieves state-of-the-art accuracy across multiple benchmarks while substantially reducing training and inference costs.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: pruning, LLM Efficiency, chain-of-thought, efficient models
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English, Python
Submission Number: 3168
Loading