# Artifact of EvoEval

## Setup 

### Requirements

- Python 3.9+
- GPU access to re-generate LLM samples

### Initial Setup

```shell
conda create -n evoeval python=3.9
conda activate evoeval
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:$(pwd)
```

This will create the neccessary conda environment and install the required packages.

### EvoEval Benchmark 

Due to the size limit of each evoeval benchmark, we did not include the raw benchmark in this repository, how it can still be
obtained via the script. We warn that the script will wget from a github repository and following the link may reveal the identity of the authors

## Generate LLM samples

Generating LLM samples can be costly, as such we have included all the pre generated LLM samples.

To extract them:

```shell
sudo apt install p7zip-full p7zip-rar
7z x llm_samples.7z
```

But you can always re-generate them using the following command:
 
```shell
python codegen/generate.py --datatset {dataset_name} --model {model_name} --root ./pregen --bs 1  --temperature 0.0 --n_samples 1 --greedy 
```

## Run EvoEval Evaluation

To run EvoEval evaluation, you can use the following command:

```shell
# if the generation is in jsonl format
python evoeval/evaluate.py --dataset EvoEval_difficult --samples EvoEval_difficult_samples.jsonl
```

You can also run the evaluation when the model output is a folder, for example after extracting the pre-generated LLM samples:
    
```shell
python evoeval/evaluate.py  --dataset EvoEval_difficult --samples gpt-4_temp_0.0/EvoEval_difficult
```

you should see the following output:
```shell
Computing expected output...
Expected outputs computed in 11.24s
Reading samples...
100it [00:00, 164.16it/s]
100%|████████████████████████████████████████████████████████████████| 100/100 [00:07<00:00, 12.77it/s]
EvoEval_difficult
pass@1: 0.520 # for reference GPT-4 solves more than 80% of problems in HumanEval
```

You can repeat this process for any of the 7 EvoEval benchmarks + HumanEval datasets by changing the `--dataset` flag.
