Finding the Cracks: Improving LLMs Reasoning with Paraphrastic Probing and Consistency Verification

Weili Shi; Dongliang Guo; Lehan Yang; Tianlong Wang; Hanzhang Yuan; Sheng Li

Finding the Cracks: Improving LLMs Reasoning with Paraphrastic Probing and Consistency Verification

Weili Shi, Dongliang Guo, Lehan Yang, Tianlong Wang, Hanzhang Yuan, Sheng Li

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language model, reasoning, critical token

Abstract: Large language models (LLMs) have demonstrated impressive performance across a variety of reasoning tasks in domains such as mathematics, coding, and planning, particularly when guided by chain-of-thought prompting to elicit intermediate reasoning steps. However, their problem-solving ability often declines on more complex tasks due to hallucinations and the accumulation of errors within these intermediate steps. Recent work has introduced the notion of critical tokens—tokens in the reasoning process that exert significant influence on subsequent steps. Prior empirical studies suggest that replacing critical tokens can refine reasoning trajectories and lead to correct answers. Nonetheless, reliably identifying and exploiting critical tokens to enhance LLM reasoning remains challenging. To address this, we propose the Paraphrastic Probing and Consistency Verification (PPCV) framework, which leverages critical tokens to improve reasoning performance. PPCV operates in two stages. In the first stage, we roll out an initial reasoning path from the original question and then concatenate paraphrased versions of the question with this reasoning path. Feeding these inputs into the LLM yields token-level logits, from which we identify critical tokens based on mismatches between the predicted top-1 token and the expected token in the reasoning path. A criterion is employed to confirm the final critical token. In the second stage, we substitute critical tokens with candidate alternatives and roll out new reasoning paths for both the original and paraphrased questions. The final answer is determined by checking the consistency of outputs across these parallel reasoning processes. We evaluate PPCV on mainstream LLMs, including Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.2 and Qwen3-32B, across multiple benchmarks covering mathematics and logical reasoning. Extensive experiments demonstrate that PPCV substantially enhances the reasoning performance of LLMs compared to baseline methods.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 10015

Loading