# Evaluations

This directory contains end-to-end pipelines for AI-enhanced evaluation. We will introduce the evaluation pipeline and the data format in this document.

## Generate Answers

### ChatGPT (gpt-3.5-turbo)

Make sure you have setup the OpenAI API Key in your environment. Then run:

```bash
python qa_baseline_gpt35.py --question table/question.jsonl --output table/answer/answer_gpt35.jsonl
```

### Bard

Unfortunately, Bard has not release its public APIs till now. You may have to enter the anwsers manually. Or you could find a third-party project that interfaces with Bard.

### Vicuna and others

To generate answers with Vicuna or other models, specify path to the model checkpoint, a desired model ID and run:
```bash
python get_model_answer.py --model-id [MODEL-ID] --model-path /model/path --question-file table/question.jsonl --answer-file table/answer/answer.jsonl --num-gpus [NUM-GPUS]
```
Then the answers to the questions will be saved in `table/answer/answer.jsonl`.
Note: we assume the model can be loaded with a single GPU.

## Evaluate Answers Automatically

### Generete Reviews with GPT-4

Note: Below script requires access to GPT-4 API. If you only have access to GPT-4 on web interface, you can evaluate the answers by manually formatting the prompt. See more details in the **Reviewers** and **Prompts** sections in **Data Format**.
It is critical to follow the prompt templates; otherwise GPT-4 may not give fair reviews. `table/review/*.jsonl` are some review examples generated by GPT-4 or you can view them on our eval [webpage](https://vicuna.lmsys.org/eval/).

To use the script for generating reviews with GPT-4, you need to `export` your OpenAI API key in environment variable. Then run:
```bash
python eval_gpt_review.py -q table/question.jsonl -a /path/to/answer_1.jsonl /path/to/answer_2.jsonl -p table/prompt.jsonl -r table/reviewer.jsonl -o /path/to/review_output.jsonl
```
The GPT-4 reviews will be saved in `/path/to/review_output.jsonl`. Note: we implement some simple parsing code to extract the score pairs from GPT-4's reviews. However, you need to double check whether the parsed score pair are correct. Sometime the parsing logic may fail if GPT-4 doesn't give a structured answer.

## Visualize Results

You can generate the data for the webpage by running:

```bash
python eval/generate_webpage_data_from_table.py
```

Then you can serve a static website in `webpage` to see the results.

## Data Format

If you want to have a deeper understanding of our evaluation pipeline or want to contribute to the evaluation process, you need to learn the data format we used for evaluation.

Our evaluation data are encoded with [JSON Lines](https://jsonlines.org/).

### Random ID Generation

We use the `shortuuid` Python library for generating short random UUIDs.

```python
import shortuuid
shortuuid.uuid() -> str
```

### Models

`model.jsonl` contains model information we used for generating anwsers.

Each row contains a record of a model with the following field:

* `model_id` (str): A unique ID for a model. Models with different IDs is supposed to have different performance. This ID is generated by `{model_name}:{model_version}`.
* `model_name` (str): The name of a model. This is not unique, because a model could be trained and updated continuously, but it is still considered as the same model with different versions.
* `model_version` (str): The version of a model.
* `model_metadata` (Any): Any metadata of a model (descriptions etc). This is optional.

For example:

```json
{
  "model_id": "vicuna-13b:v1",
  "model_name": "vicuna-13b",
  "model_version": "v1",
  "model_metadata": "learning rate 1e-5, 3 epochs, 13b"
}
```

### Prompts

We store prompts in `prompt.jsonl`. Each row contains a record of a prompt with the following field:

* `prompt_id` (int): A unique integer ID for a prompt. Prompts with different IDs are supposed to have different purpose.
* `system_prompt` (str): The system prompt given to a model. This is the prompt that the model sees first.
* `prompt_template` (str): The prompt body. This is the user prompt that the model sees after the system prompt. It is a Python f-string template, so that we can fill in the inputs later.
* `defaults` (dict): A dictionary of default values for the prompt template. It can be empty.
* `description` (str): A description of the functionality of the prompt.

For example:

```json
{
  "prompt_id": 1,
  "system_prompt": "You are a helpful assistant.",
  "prompt_template": "[Question]\n{question}\n\n[Assistant 1]\n{answer_1}\n\n[End of Assistant 1]\n\n[Assistant 2]\n{answer_2}\n\n[End of Assistant 2]\n\n[System]\n{prompt}\n\n",
  "defaults": {"prompt": "Which assistant is more helpful?"},
  "description": "Compare two assistants' answers to a question."
}
```

### Reviewers

`reviewer.jsonl` contains reviewer information we used for reviewing answers generated by different models. Each row contains a record of a reviewer with the following field:

* `reviewer_id` (str): A unique ID for a reviewer. Reviewers with different IDs is supposed to have different reviewing performance.
* `prompt_id` (str): The ID of the prompt given to the reviewer (e.g., an AI assistant). Different prompts could result in different reviewing performance.
* `metadata` (dict): Metadata of a reviewer about its configurations.
* `description` (str): A description of the reviewer.
* `category` (str): The category that the reviewer belongs to.

For example:

```json
{
  "reviewer_id": "gpt-4-0328-default",
  "prompt_id": 1,
  "temperature": 0.2,
  "max_tokens": 8192,
  "description": "GPT-4 for general questions.",
  "category": "general"
}
```

### Questions

`question.jsonl` contains questions we used for evaluation. Each row contains a record of a question with the following field:

* `question_id` (int): A unique integer for a question. Questions with different IDs is supposed to be different.
* `text` (str): The question text.
* `category` (str): The category of the question. Questions with the same category are supposed to be similar or originate from the same source.

### Answers

`answer/xxx.jsonl` contains answers generated by different models. Each row contains a record of an answer with the following field:

* `answer_id` (str): A unique UUID for an answer. Answers with different IDs is supposed to be different.
* `question_id` (int): The ID of the question the answer is generated for.
* `model_id` (str): The ID of the model the answer is generated by.
* `text` (str): The answer text.
* `metadata` (dict): Any metadata of the answer.

Example:

```json
{
  "answer_id": "[short uuid]",
  "question_id": 1,
  "model_id": "vicuna-13b:v1",
  "text": "Here are five tips...",
  "metadata": {}
}
```

### Reviews

`review/xxx.jsonl` contains reviews given by reviewers, comparing peformance between a pair of models. Each row contains a record of a review with the following field:

* `review_id` (str): A unique UUID for a review. Reviews with different IDs is supposed to be different.
* `question_id` (int): The ID of the question the review is given for.
* `answer1_id` (str): The ID of the first answer.
* `answer2_id` (str): The ID of the second answer.
* `text` (str): The review text.
* `score` (list): A list of scores given by the reviewer. The first score is for the first answer, and the second score is for the second answer.
* `reviewer_id` (str): The ID of the reviewer.
* `metadata` (dict): Any metadata of the review.

```json
{
  "review_id": "[short uuid]",
  "question_id": 1,
  "answer1_id": "[answer1_id]",
  "answer2_id": "[answer2_id]",
  "text": "Assistant 2 is better...",
  "score": [9.0, 7.5],
  "reviewer_id": "gpt-4-0328-default",
  "metadata": {}
}
```
