## micro-benchmarking-reliability

This repository contains code and data for reproducing the results
for the ICLR submission "How Reliable is Language Model Micro-Benchmarking?"

**Install requirements:**

```
pip install -r requirements.txt
```

## Steps to reproduce the results in the paper

The `graphs-combine-subtasks` directory contains the final processed results used in the paper.
Steps 1 and 2 are provided to show the full experimental setup.
Skip to step 3 to plot the existing results.

### 1. Run micro-benchmarking methods:

When the code is publicly released, we will also include cached model evaluation results from the
[Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/blog) that will allow for
re-running all micro-benchmarking methods.

Here is an example command to run the micro-benchmarking evaluations for the
MMLU-Pro dataset:

```
python evaluate-microbenchmarks.py \
    --selection_techniques Random Random_Subtask_Stratified_Equal Anchor_Points_Weighted Stratified_Random_Sampling tinyBenchmarks DPP \
    --num_source_models 300 \
    --num_runs 50 \
    --benchmark mmlu-pro \
    --combine_subtasks \
    --same_points \
    --num_threads 10
```

To reproduce the main results in the paper, you will need to run this for the
other benchmarks as well: `mmlu`, `bbh`, `gpqa`.

### 2. Process micro-benchmarking results:

All results need to be processed by running the following command:

```
python process-results-combine-subtasks.py
```

### 3. Make all plots:

Each file that begins with `figure` can be used to reproduce a figure from the paper.
For example, `figure-1.py` will reproduce Figure 1.

### Licenses

We use and adapt code from [Anchor Points](https://github.com/rvivek3/AnchorPoints),
[tinyBenchmarks](https://github.com/felipemaiapolo/tinyBenchmarks),
[py-irt](https://github.com/nd-ball/py-irt),
and [DPPcoresets](https://github.com/hsimonfroy/DPPcoresets).
Their licenses are available in the `licenses` directory.