Efficient Thinking via Meta Chain-of-Thought Evaluation

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM reasoning, meta chain-of-thought
Abstract: Large Language Models (LLMs) have shown remarkable capabilities in handling complex tasks. Recent breakthroughs in Large Reasoning Models (LRMs)—such as OpenAI’s o1 and DeepSeek-R1—have pushed performance even further in System-2 reasoning domains like mathematics and programming. By leveraging supervised fine-tuning (SFT) and reinforcement learning (RL), these models enhance Chain-of-Thought (CoT) reasoning. However, while longer CoT sequences boost accuracy, they also lead to increased computational costs due to verbose and redundant outputs, a challenge termed the "overthinking phenomenon". In this paper, we design a novel and efficient framework called Dynamic Verify Stopping in Long Reasoning (DVS-LR) to resolve the issue of overthinking. An early-stop verifier is trained to evaluate the Meta-CoT from multiple dimensions, like completeness, correctness, and self-validation. DVS-LR receives the generating CoT stream and activates the verifier at adaptive checkpoints. If the score received from the verifier reaches a threshold, the current CoT generation is terminated, and the LRMs directly output the final answer without further thinking. Experiments on various math tasks benchmark show that our proposed method achieves 30\% cut ratio while maintaining original accuracy. Based on the observation of the average length of after-cut length, we propose the ``Budget Forcing Early Stop Majority Voting'' method when the token budget is fixed. Experiments show that this method can improve the accuracy on various benchmarks compared with the original one-chain generation.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11861
Loading