Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers

Zhicheng Yang; Zhijiang Guo; Yinya Huang; Yongxin Wang; Yiwei Wang; Xiaodan Liang; Jing Tang

Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang

16 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RLVR, Test-Time Scaling, Reasoning with LLMs

TL;DR: Synthesize high quality critique data and conduct RLVR verifier training to improve test-time scaling.

Abstract: Test-time scaling via solution sampling and aggregation has become a key paradigm for improving the reasoning performance of Large Language Models (LLMs). While reward model selection is commonly employed in this approach, it often fails to identify minority-yet-correct answers, which limits its effectiveness beyond that of simple majority voting. We argue that this limitation stems from a lack of informative critique signals during verifier training. To bridge this gap, we introduce \textbf{Mirror-Critique}, a framework that trains a verifier with informative critiques. Our key insight is to leverage the rich critique signal by contrasting model-generated solutions with ground-truth solutions. We deploy a small instruction-tuned model to synthesize high-quality critique data with rejection sampling that teaches the verifier not only what is wrong, but also why. The synthetic data is used to cold-start the LLMs in the RLVR process to further improve the verification ability. The resulting \textbf{Mirror-Verifier} is deployed to evaluate candidate solutions by generating multiple critiques per solution, aggregating them into a verify score used for weighted voting or selective abstention. The experimental results show that our \textbf{Mirror-Verifier} significantly outperforms majority voting in terms of solution accuracy and also improves the solver's honesty to recognize and abstain from answering beyond its capability boundaries.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 7561

Loading