One-Token Verification for Reasoning LLMs, Anytime, Anywhere

Zhan Zhuang; Xiequn Wang; Zebin Chen; Baijiong Lin; Jianfeng Wang; Ying Wei; Na Mou; Kede Ma; Yu Zhang

One-Token Verification for Reasoning LLMs, Anytime, Anywhere

Zhan Zhuang, Xiequn Wang, Zebin Chen, Baijiong Lin, Jianfeng Wang, Ying Wei, Na Mou, Kede Ma, Yu Zhang

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, parallel thinking, LoRA, test-time scaling

Abstract: Reasoning large language models (LLMs) have recently achieved breakthrough performance on complex tasks such as mathematical problem solving. A widely used strategy to further improve performance is parallel thinking, wherein multiple reasoning traces are generated, and the final prediction is chosen using methods such as Best-of-$N$ or majority voting. However, two limitations remain: most existing methods lack effective mechanisms for assessing the quality of reasoning traces, typically relying only on final logits or answers, and multi-sample decoding incurs substantial inference latency for long outputs. To address these challenges, we propose **One-Token Verification** (OTV), a lightweight framework for assessing reasoning quality via a single-token forward during generation. The proposed OTV method introduces a learnable LoRA-based role vector that, without interfering with the primary reasoning process, enables the LLM to assume a verification role and probe the past KV cache for correctness confidence estimation. Unlike generic verifiers or external reward models, OTV is trained natively for each LLM, directly leveraging internal computations for token-level scoring. As a result, OTV can provide confidence signals at any point in generation and for any token position, realizing "anytime, anywhere'' verification. Experiments on math benchmarks show that OTV consistently outperforms the state-of-the-art baselines in parallel thinking. Moreover, based on OTV, we introduce efficient variants that terminate most traces early and retain only a single complete reasoning path, reducing token usage by up to 90\%. In this setting, OTV maintains superior performance while favoring shorter and more reliable solutions.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 11214

Loading