UnifiedVerifier: Unifying Paradigms in Automated LLM Evaluation

20 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, LLM Evaluation, LLM-as-a-Judge, Customizable Evaluation
TL;DR: We introduce UnifiedVerifier, a unified LLM evaluation framework that uses innovative data generation and training methods to achieve comprehensive, customizable evaluation in a single model, outperforming much larger models with greater efficiency.
Abstract: The current landscape of Large Language Model (LLM) evaluation is fragmented, with bespoke models for objective verification (e.g., answer\&process-verify, fact-checking) and subjective judgment (e.g., response quality ranking) operating in isolation. These models are often trained under specific task paradigms or fixed prompts, lacking versatility and failing to accommodate user needs for customizable evaluation criteria, input forms, and output formats. To address these challenges, this paper introduces UnifiedVerifier, an innovative framework designed to achieve comprehensive, general-purpose, and customizable verification capabilities within a single model. The core contributions of UnifiedVerifier are twofold: first, we present Evolutionary Verification Data Synthesis (Evo-Verify), a multi-stage, evolution-inspired automated pipeline that systematically generates a large-scale, high-fidelity training dataset. This dataset spans an extensive array of verification dimensions, intricate judgment criteria, and varied output formats, thereby fostering unprecedented versatility. Second, we propose an alignment technique called "Core-Anchored Reinforcement Learning" (CARL), which effectively mitigates the pervasive issue of reward hacking in conventional reinforcement learning by anchoring a majority of the reward signal to verifiable, objective ground truths, ensuring robust and reliable model alignment. Experimental results show that our UnifiedVerifier, trained on a 4-billion-parameter model, not only surpasses its base model across a suite of benchmarks covering both objective and subjective tasks but also outperforms larger thinking models on key objective and subjective verification tasks at only one-tenth the inference cost compare to the base thinking model. This demonstrates that the UnifiedVerifier framework achieves an exceptional balance between generality, performance, and efficiency, offering a new paradigm for building the next generation of LLM evaluation tools.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23477
Loading