# 🚀 RocketEval: Efficient Automated LLM Evaluation via Grading Checklist

## Intro

This is the official repo of the paper "RocketEval: Efficient Automated LLM Evaluation via Grading Checklist".
<!-- ### Evaluation Framework
<details>
  <summary>Click to expand</summary>

  ![img1]()

</details> -->

## Quick Start

You can start the evaluation on the example `mt-bench` benchmark dataset and the selected models for test by running `src/run.py`:
```shell
# vLLM (Batch mode)
python src/run.py --dataset mt-bench --generator google/Gemma-2-27B-it --judge google/Gemma-2-2B-it

# --generator: The generator model used in checklist generation.
# --judge: The judge model used in checklist grading.

# Please configure your OpenAI compatible client first if using api
export OPENAI_API_KEY=<API_KEY>
export OPENAI_BASE_URL=<URL>

# API (Batch mode)
python src/run.py --dataset mt-bench --generator gpt-4o --judge gpt-4o-mini --openai_client
# API (Instant mode, only suggested for API that does not support batch mode)
python src/run.py --dataset mt-bench --generator gpt-4o --judge gpt-4o-mini --openai_client --instant_api

```

## Preparing Data
We have provided 4 example public benchmark datasets in the `data` folder. You can also use your own data by preparing the following types of files. All files should be stored using JSON line (.jsonl) format. The data format is mostly following [WildBench](https://huggingface.co/datasets/allenai/WildBench) to ensure compatibility with other evaluation tools.

### 📝 Queries

```json
{
    "session_id": "<Identifier of the query in RocketEval>",
    "conversation_input":[
        {"content": "<Historical user query, used as context>", "role":"user"},
        {"content": "<Historical system response, used as context>", "role":"assistant"},
        {"content": "<Current user query>", "role":"user"}
    ],
    "checklist":[],
    "references":{
        "gpt-4": "<Reference response>",
    }
}
```

### 📝 Responses

```json
{
    "session_id":"<Identifier of the query in RocketEval>",
    "chat_history":[
        "<Historical user query, used as context>",
        "<Historical system response, used as context>",
        "<Current user query>"
    ],
    "output":["<Reponse to current user query>"],
    "generator":"<Name of generator model>",
}
```

> The fields that exist in [WildBench](https://huggingface.co/datasets/allenai/WildBench) but not listed here are not used in RocketEval.

Then put the files in the `data` folder in the following structure:

```
data
├── <DATASET_NAME>
│   ├── queries.jsonl
│   └── response
│       └── <MODEL_NAME_1>.jsonl
│       └── <MODEL_NAME_2>.jsonl
```

All test models stored will be loaded and evaluated by RocketEval automatically. If you want to run evaluation on a specific list of models, you can add `<DATASET_NAME>_train.json` and `<DATASET_NAME>_test.json` in the `config/rankings` folder. The files should contain the list of model names to be included in the training and testing set, respectively. Each element in the JSON file should be:
```json
{
    "name": "<MODEL_NAME>",
    "rating": "<RATING OF MODEL, USED AS THE GROUNDTRUTH RANK (OPTIONAL)>"
}
```

## Running Evaluation Step-by-Step

Instead of running the evaluation in one command, you can also run the evaluation step-by-step by `src/run_task.py` as follows:

```shell
DATASET=mt-bench
GENERATOR=google/Gemma-2-27B-it
JUDGE=google/Gemma-2-2B-it
LABEL_JUDGE=gpt-4o

python src/run_task.py checklist --dataset ${DATASET} --generator ${GENERATOR}
python src/run_task.py judgment --dataset ${DATASET} --judge ${JUDGE}
python src/run_task.py score --dataset ${DATASET} --judge ${JUDGE} --label_judge ${LABEL_JUDGE}
python src/run_task.py ranking --dataset ${DATASET} --judge ${JUDGE}
```

### Checklist Generation

You can generate the checklist by `checklist` option. The function will output the checklist for the test set. Alternatively, you can also import the created checklist into a JSON line file. The format of each item is as follows:

```json
{
    "session_id": "<Identifier of the query in RocketEval>",
    "checklist":[
        "<Checklist item 1>",
        "<Checklist item 2>",
        "<Checklist item 3>"
    ]
}
```

### Checklist Grading

Running the `judgment` option will grade the checklist for the specified test models. The function will output the grading results for the test set. The format of each item is as follows:

```json
{
    "session_id": "<Identifier of the query in RocketEval>",
    "model_test": "<Model name>",
    "judge": "<Judge model name>",
    "norm_probability": [0.1, 0.3, 0.5, 0.7, 0.9],
    "judgment": ["No (10%)", "No (30%)", "Unsure (50%)", "Yes (70%)", "Yes (90%)"],
}
```

### Predicting Scores

RocketEval will predict the final scores by learning a predictor from the training set from a powerful judge model (e.g., GPT-4). You can derive the score from external tools (like [WildBench](https://github.com/allenai/WildBench), [FastChat LLM Judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge)) and convert the scores to the WildBench compatible format as follows:

```json
{
    "session_id": "<Identifier of the query in RocketEval>",
    "model_test": "<Model name>",
    "score": 3.0
}
```

### Produce Rankings

You can produce the rankings by `ranking` option. The function will output the rankings for the test set.


## Output Simulated Matches for Chatbot Arena

You can output the simulated matches for [LMSYS Chatbot Arena](https://lmarena.ai/) by `chatbot_arena_match` function. The function will output the matches between all test models.

```python
from rocketeval.tools.export import chatbot_arena_match
from rocketeval.data.data_loader import load_target_models

model_names = load_target_models(config_dir="config/", dataset_name="mt-bench", split="test")
result = chatbot_arena_match("mt-bench", "Gemma-2-2B-it", model_names, "data/")
result.to_json("matches.jsonl", orient="records", lines=True)
```

The output `matches.jsonl` can be loaded by the [notebook](https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH) to calculate the elo rating and conduct analysis.



