## Get Started for Evaluation

This document provides a step-by-step guide to set up the evaluation environment for the MagiC Benchmark. It includes instructions for installing required packages and running the evaluation script.

### Pre-requisites - Install required packages

```bash
# Only create a new environment if you don't have one already
conda create -n tbys python=3.10
conda activate tbys
# Install the required packages
python3 -m pip install -r requirements.txt
# vLLM only works with cuda backends do not install if you are using CPU
python3 -m pip install vllm==0.8.2 bitsandbytes==0.45.4
# Install flash-attention
python3 -m pip install flash-attn --no-build-isolation
```

### Run LLM as Judge

The following command runs a LLM as a judge for the evaluation. It judges if the ground truth and the model output means the same (yes, no). Prompt template is located [here](./static). The output file will be saved in the same directory as the input file as a `csv` file. 

_If GPU VRAM is not sufficient to load the entire model with full context length, set a smaller value for `max-model-len`, `34000` will work for GPUs that has a VRAM of 80GB._

```bash
# For final answer grading
python3 final_answer_grader.py --model-id Qwen/Qwen2.5-72B-Instruct-AWQ --data-input PATH_TO_JSONL --do-sample --max-model-len 34000 --task final_answer --quantization-method awq --answer-type short
```

```bash
# For self-correction grading
python3 final_answer_grader.py --model-id Qwen/Qwen2.5-72B-Instruct-AWQ --data-input PATH_TO_JSONL --do-sample --max-model-len 34000 --task self_correction --quantization-method awq --answer-type full
```

### Get Final Answer Report
You can run `print_judge_accu.py` to print a Markdown version of the performance table.

```bash
python3 print_judge_accu.py --data-input PATH_TO_CSV_FILES
```

### Get Attention Score
You can run `attention_grader.py` to print a Markdown version of the performance table.

```bash
python3 print_judge_accu.py --model-output PATH_TO_JSONL_FILES --punish-hallucination --micro --consolidated-annotations PATH_TO_TEST_SET_JSONL
```