# AutoEval

## Installation

### System Installation

This code has been tested on Ubuntu 22.04. To install the code, create a virtual environment in python as follows.

```bash
apt update
apt install -y python3-venv
python3 -m venv venv
source venv/bin/activate
```

Next, install the required software by executing `./scripts/install.sh`

### Environment Setup

Setup some required environment variables. For api keys, only export the variables that you expect to use.

```bash
export PYTHONHASHSEED=0
export OPENAI_API_KEY=<your_openai_api_key>
export ANTHROPIC_API_KEY=<your_anthropic_api_key>
export HUGGINGFACE_API_KEY=<your_huggingface_api_key>
```

Also login to huggingface via `huggingface login` and store the token. Make sure your account can access Llama-3 etc.

## Datasets

Possible dataset types are

```
fol: First-order logic (Synthetic)
fol_human: First-order logic (English)
ksat: 3-SAT
plogic: Propositional Logic
regex: Regular Expressions
```

### Packaged datasets

The zero-shot dataset can be found under `./dataset`

The file hierarchy is as follows: `batch_no -- dataset_type`.
Each dataset type has a corresponding `dataset.json` which contains the vanilla
dataset and contains several `<model>.verified.json` files which contain
results from AutoEval pipeline that may be used as a dataset for re-computing the metrics. Due to space limitations on the .zip, we only include the responses of our experiments for the first two batches. Check out the `./dataset/batch{0,1}` directories for these files.

Each dataset is a `json` with an `info` key that contains the seed and other
parameters used to generate it. The `data` field contains a list of the data.
Each data item is a dictionary with `fs` indicating the ground-truth formal syntax
and some keys like `depth` indicating the CFG tree depth for the expression.

The verified datsets contain additional fields like `llm_fs` to indicate the
llm generated formal syntax, the complete LLM response in `conversation_history`,
and the verification results in `verification`.

### Generating your own datasets

Use the following script

```
python dataset_generator.py \
    --base-dir <base_dir> \
    --dataset-type <dataset_type> \
    --depth <depth> \
    --n <branching_factor_to_sample> \
    --seed <random_seed> \
```

to generate datasets of different types. The `-h` command can give more
information about dataset specific parameters. (eg. `--num-propositions`) for
propositional logic.

### Generating new datasets

Please provide new grammars in either `nlfs.dataset.propositional_dataset` or
`nlfs.dataset.regex_dataset` to create any new dataset based on these types.
The pipeline should work out-of-the-box in that case.

For new types of formal syntax, the appropriate verifier must also be included
in `nlfs.verifier`.

## Reproducing ICLR-25 results

To reproduce our results reported in the main paper, use the following scripts.
You can skip the LLM or the verifier by using the `--skip-nlfs` and `--skip-verify`
options to the command below. The results and log files are stored in the same
directory. For example, batch 0, fol saves its results in 
`./dataset/batch0/fol`.

```bash
python3 evaluation.py --auto \
  --total-batches 10 \
  --base-dir ./dataset/
  --datasets fol ksat ... \
  --models gpt-3.5-turbo claude ...
```

Possible models are:

```
gpt-3.5-turbo: ChatGPT
gpt-4o: GPT-4
claude: Anthropic Claude Sonnet
phi-3: Microsoft Phi
llama-3-8b: Meta Llama-3
mistral: Mistral 7B v0.2
```

Note that you need to setup the api keys via environment variables as described
above.

For running open-source models, please setup a vLLM server using instructions
[here](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html).

## Scripts for reproducing with new seed datasets.

Use `./scripts/iclr25/dataset_gen.sh` to regenerate the datasets. Note, this will delete the existing dataset files and regenerate a new dataset under `./dataset`.

Similarly, `./scripts/iclr25/o1_dataset_gen.sh` will recreate the o1 datasets under `./dataset`. Note that both of these scripts delete any other dataset and regenerate new ones.

If you want to regenerate the same datasets as we ran, check for the `seed` parameter in the metadata of any .json and pass `--seed` to the commands inside these script files.

## Scripts for running the models and reproducing the results

Execute `./scripts/iclr25/reproduce_run.sh <model_name> <dataset_name>` to run the results on the datasets (either new ones you generated or the ones provided).

The model name and dataset strings to use are indicated above.

This script will run the AutoEval pipeline on the dataset in `./dataset` for 10 batches.

## Running FOLIO and LogiEval

You can run LogicEval by executing `./scripts/iclr25/logic_eval.sh`. Edit the script to use the correct model_name. You still need to setup the environment variables.
