# Judge Benchmark Pipeline

This repository provides a pipeline for generating, merging, and processing model outputs on evaluation datasets. It is designed to run multiple models in parallel across GPUs, merge their results, and process the outputs for analysis.

---

## 📂 Project Structure

- **`preprocess_dataset.ipynb`**  
  Jupyter notebook for preparing and preprocessing datasets before running evaluations.

- **`gen_judge.sh`**  
  Shell script to automate the end-to-end pipeline:

  - Runs generation with `gen_vllm_judge.py` across GPUs.
  - Merges outputs with `merge.py`.
  - Processes results with `process_result.py`.

- **`gen_vllm_judge.py`**  
  Script for generating model outputs on benchmark datasets using vLLM.

- **`merge.py`**  
  Merges distributed generation results into a single JSON file.

- **`process_result.py`**  
  Processes merged results into final evaluation-ready format.

- **`load_prompt.py`**  
  Helper script for loading prompts/templates used in generation.

- **`load_judge_dataset.py`**  
  Utility script for loading benchmark datasets such as **rm_bench** and **judgebench**.

---

## ⚙️ Usage

### 1. Preprocess Dataset

Prepare datasets using the provided notebook `preprocess_dataset.ipynb`

### 2. Run Judge Script

Update the `BASE_PATH` in `load_judge_dataset.py` to point to your dataset location.

Then execute the full pipeline:

```bash
bash gen_judge.sh
```

Results will be saved `result/<model_name>_<mode>/<dataset_name>/generations.json`

Generations from **Qwen3** are available for download here:  
[Google Drive link](https://drive.google.com/drive/folders/1wRejsd4qa72SqNxJMzwUd43gb0ZgUm0Y?usp=sh

The current setup uses **vLLM==0.10.0**, but feel free to adjust the version as needed to ensure compatibility with different models.




# Json Data
`chatbot_arena.json`: Preference data from the `GAIR/preference-dissection` dataset, containing human preference comparisons of chatbot responses.

`PPE_HF.json`: Human preference evaluations from the `PPE-Human-Preference-V1` dataset, removing ties.

`reward_bench_v2.json` – Data from `Reward-Bench v2`, removing ties.

`judgebench.json` – Combined GPT and Claude subsets from `JudgeBench`, which evaluates LLM-as-a-Judge on challenging tasks (knowledge, reasoning, math, coding).

`helpsteer3.json` – Processed data from `HelpSteer3` (preference split), filtered to remove empty contexts and neutral (no-preference) examples, with a split field marking train/test.

`magpie_pro.json` – A subset of `Skywork` Reward Preference 80K, filtered to only include data from the magpie_pro_llama3.1 source.

`mixture.json` – A mixture dataset combining math preference pairs (from Math-Step-DPO-10K) and code preference pairs (from Code-Preference-Pairs), with standardized fields: prompt, chosen, and rejected.