Keywords: LLM, Verification, Tool-integrated reasoning
Abstract: Answer verification methods are widely employed in language model training pipelines spanning data curation, evaluation, and reinforcement learning with verifiable rewards (RLVR).
While prior work focuses on developing unified verifiers applicable across multiple reasoning scenarios, significant challenges remain in computation-oriented scientific domains, such as algebraic equivalence checking.
In this paper, we introduce CosineVerifier, a tool-augmented verifier that leverages external executors to perform precise computations and symbolic simplifications.
CosineVerifier enables robust verification that goes beyond simple semantic matching.
To train this accurate tool-augmented verifier, we propose a novel verifier training data augmentation method and a two-stage training framework to increase the correctness of tool-invoked verifications on computation-heavy questions.
Extensive experiments across STEM, QA, and long-form reasoning tasks demonstrate CosineVerifier's robust generalization, achieving state-of-the-art performance on VerifyBench-Hard and SCI-Bench. Furthermore, when employed as an RLVR reward model, CosineVerifier consistently outperforms both rubric- and model-based verifiers on AIME'24, AIME'25 and GPQA-D, highlighting its potential to advance LLM reasoning.
Paper Type: Long
Research Area: Natural Language Generation
Research Area Keywords: generation, automatic evaluation, applications, chain-of-thought
Contribution Types: Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 9694
Loading