# Large Language Monkeys

This repository provides the accompanying code for Large Language Monkeys: Scaling Inference-Time Compute with Repeated Sampling.

Specifically, the code needed to:
1. Generate samples from various datasets and models.
2. Evaluate the correctness of the samples.

Four datasets are supported:
- GSM8K
- MATH
- CodeContests
- MiniF2F-MATH

We use [vLLM](https://docs.vllm.ai/en/latest/index.html) to do inference, so any models that they support will work with our generation scripts.

## Installation

We use two different conda environments for this project, as the lean-dojo version we use requires Python 3.9.19.

### Environment for MiniF2F-MATH

```
conda create -n llmonk-minif2f python=3.9.19
pip install -r requirements_minif2f.txt
```
To run evaluation on this dataset, we additionally need to install lean4. To do this, follow the installation instructions for your system according to [this website](https://leanprover-community.github.io/get_started.html).

When prompted with 
```
Current installation options:

  default toolchain: stable
  modify PATH variable: yes

1) Proceed with installation (default)
2) Customize installation
3) Cancel installation
```
Choose 2, and change the default toolchain to: `4.3.0-rc2`.

### Evironment for everything except MiniF2F-MATH

```
conda create -n llmonk python=3.11.8
pip install -r requirements.txt
```

## Repository Structure

The repo is organized as follows:

```
large-language-monkeys/
├── llmonk/
│   ├── evaluate/
│   │   ├── gsm8k.py
│   │   ├── math.py
│   │   ├── code_contests.py
│   │   └── minif2f.py
│   ├── generate/
│   │   ├── gsm8k.py
│   │   ├── math.py
│   │   ├── code_contests.py
│   │   └── minif2f.py
│   └── tests/
│   │   ├── math_datasets.py
│   │   ├── code_contests.py
│   │   └── minif2f.py
├── README.md
└── requirements.txt
```

- `llmonk/evaluate/`: contains the code to evaluate dataset samples
- `llmonk/generate/`: contains the code to generate samples from a model
- `llmonk/tests/`: contains code to check the correctness of our evaluation scripts

Within each folder, there is a file for each of the supported datasets (note that the scripts for MATH and GSM8K are combined under "math_datasets" for evaluation and testing).

## Generation Scripts

These scripts are used to generate samples from a model for a dataset.

### Usage

Each file has two mandatory arguments:
1. `model`: the huggingface model to use to generate the samples (same string you would pass to `.from_pretrained`)
2. `save_dir`: the directory to save the samples

For the remaining optional arguments (ex. temperature, number of samples, batch size, vllm arguments), please see the `GenerateScriptConfig` class in `llmonk/utils.py`.

### Output Format

The samples are saved as YAML files (one YAML file per problem). Every dataset's YAML file contains the following keys:
- `prompt`: the prompt for the problem
- `question`: the current question for the problem
- `samples`: a list of samples for each problem

For GSM8K and MATH, there is the additional key:
- `gt_answer`: the dataset's ground truth answer for the problem

For CodeContests, there is the additional key:
- `test_cases`: dictionary with the following keys:
    - `input`: list of strings corresponding to test case inputs
    - `output`: list of strings corresponding to test case outputs

For MiniF2F-MATH, there is the additional key:
- `theorem_name`: the name of the theorem to be proven

## Evaluation Scripts

These scripts evaluate the correctness of the samples generated by the generation scripts.

### Usage

Each file has two mandatory arguments:
1. `samples_dir`: the directory to the samples
2. `save_dir`: the directory to save the evaluation results

For the remaining optional arguments (ex. number of workers), please see the `EvaluateScriptConfig` class in `llmonk/utils.py`.

### Output Format

The evaluation results are saved as YAML files (one YAML file per problem), in the same format as the samples generated by the generation scripts with the additional key:
- `is_correct`: a list of booleans indicating whether each sample is correct, `is_correct[i]` is True if and only if `samples[i]` is correct

## Testing Scripts

The `llmonk/tests/` directory contains unit tests to evaluate the correctness of the evaluation scripts.

## Example Commands

See `commands.md` for examples of how to run generation, evaluation, and testing.
