<div align="center">
     <h1>VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models</h2>
</div>

## Overview 🦾🦾
In this paper, we present VerifyBench, a benchmark specifically designed to evaluate the accuracy of reference-based reward systems. To create VerifyBench, we curated a diverse collection of instructions paired with reference answers sourced from existing open datasets. Responses to these instructions were generated by multiple open-source and proprietary LLMs. The correctness of each response was assessed using both automated model judgments and human evaluations. Each instance in VerifyBench was verified by at least two human annotators to ensure label consistency and reliability, thereby producing a high-quality benchmark for the evaluation of reward systems.

Recognizing the need to differentiate between various verification techniques and to push the boundaries of current capabilities, we further developed VerifyBench-Hard, a more challenging variant of our benchmark. This dataset focuses on contentious cases where leading models produce highly conflicting judgments, providing a more stringent test for reward system accuracy. VerifyBench-Hard samples were carefully selected based on disagreement patterns among high-performing models, then subjected to thorough human annotation to ensure label quality.

Our contributions can be summarized as follows:  
-  To better reflect realistic reinforcement learning (RL) scenarios for reasoning models, we construct VerifyBench, a benchmark derived from existing models and datasets, to provide an objective evaluation of the accuracy of reference-based reward systems.
- We further develop VerifyBench-Hard, a more challenging benchmark curated from cases exhibiting high disagreement among multiple models. This dataset contains a larger proportion of difficult-to-verify samples, highlighting substantial potential for improvement in current models.
- We conduct a comprehensive empirical analysis of model performance on both VerifyBench and VerifyBench-Hard, offering actionable insights to advance the accuracy of reference-based reward systems and enhance RL training in reasoning tasks.

## Try VerifyBench!
Run `evaluate.py` to test your own models on VerifyBench and VerifyBench-Hard.
```bash
# for VerifyBench
python3 evaluate.py --model_name_or_path <your_model_path>

# for VerifyBench-Hard
python3 evaluate.py --model_name_or_path <your_model_path> --hard

# for No-Reference scenario
python3 evaluate.py --model_name_or_path <your_model_path> --wo-ref
```
