This folder contains two repositories: InfDist, which includes the code for performing Influence Distillation, and mmft, which is a fork of [this repo](https://github.com/hamishivi/automated-instruction-selection), and contains the code for running experiments.

## Installation
To run the experiments, first create a new environment with python 3.12.9, and then run the following commands to install the dependencies:

```bash
mamba create --name infdist python=3.12.9 # you can use conda or venv as well
mamba activate infdist

pip install torch==2.5.1
pip install -r requirements.txt
pip install git+https://github.com/Dao-AILab/fast-hadamard-transform.git

# install InfDist
cd InfDist && pip install -e . && cd ..
```

## Running the Experiments
Due to size limits, we only include a pool of 50k samples from Tulu v2, along with another 10k samples we use for warmup. See folder  `./data/training_data/`. In case you are interested in running the experiment for larger pool sizes, refer to the README of [this repo](https://github.com/hamishivi/automated-instruction-selection), download the full data, and create larger pools. You can use the `POOL_SIZE` argument in the scripts we provide to control which pool is used.

These scripts by default train Llama2-7B. But you can change that but modifying the `MODEL_NAME` argument in the scripts. Notably, this needs one GPU with 80GB memory (such as H100).

Please cd into the `mmft` folder, and make sure the `infdist` environment is activated.

### Warm-up
We first need to run a warm-up training on 10k samples. For this, run

```
CUDA_VISIBLE_DEVICES=0 bash run_warmup_training.sh
```
This script fine-tunes the model on the warm-up data, and stores the result in the `./checkpoints` directory. Please find that directory, and within it, find the last checkpoint folder, named `checkpoint-*`. We will need that later.

Notably, this script also evaluates the model on all 6 tasks mentioned in the paper, the result of which you can find in the model path, e.g., `eval_gsm`.

### JVP Embeddings
Now we use the warmed-up model to embed our 50k samples in the pool. Run:

```
CUDA_VISIBLE_DEVICES=0 bash run_jvp_embedding.sh MODEL_PATH=/path/to/the/last/warmup/checkpoint/
```

This will calculate and store the embeddings in the `./embeddings` folder. Note: this implementation can be improved in terms of speed, by batching the samples and tangents.

### Selection and Training
Now having the warmed-up model and the JVP embeddings, we select 10k samples from the pool and train our model on it.

```
CUDA_VISIBLE_DEVICES=0 bash run_infdist_training_multiphase.sh TASK=mmlu_shots NUM_LANDMARKS=2048 RESTART_AFTER_WARMUP=true DISTILL_MODEL=/path/to/the/last/warmup/checkpoint/ INDEX_PATH=/path/to/jvp/embeddings
```

This will select data for `mmlu`, fine-tune and evaluate the resulting model. You can try other datasets: `gsm8k_shots`, `bbh_shots`, `tydiqa_shots`, `codex`, and `squad`.


For running other baselines, please refer to the README in [this repo](https://github.com/hamishivi/automated-instruction-selection).