# MiniPLM: Knowledge Distillation of Pre-Training Language Models

<img src="figures/method.png"></img>

## 1 Setup
```bash
pip3 install -r requirements.txt
git clone https://github.com/EleutherAI/lm-evaluation-harness
pip3 install -e lm-evaluation-harness
```
or
```bash
bash install.sh
```

## 2 Pre-Training Corpus $\mathcal{D}$
We use [the Pile](https://huggingface.co/datasets/monology/pile-uncopyrighted) as our pre-training corpus. Refer to `tools/get_pile.py` to get the data ready. Run the following command for tokenization:
```bash
bash scripts/tools/process_data/pile_qwen.sh /PATH/TO/MiniPLM
```
The processed data is stored in `processed_data/pretrain/pile/qwen-1025`, containing several shards (a pair of `.bin` and `.idx` files). Each shard contains about 1B tokens. We provide the processed version (100B tokens) for reproducibility.


## 3 Models
### 3.1 Teacher Model
We use [Qwen1.5-1.8B](https://huggingface.co/Qwen/Qwen1.5-1.8B) as the teacher LM. You can download this model can put it in `checkpoints/qwen/1.8B`.
### 3.2 Reference Model
The reference model is a 104M Qwen LM trained on 5B tokens randomly split from the Pile, which should be put in `checkpoints/qwen/104M_ref`.


## 4 Training
### 4.1 MiniPLM
#### Difference Sampling
First, run inference of the teacher LM and the reference LM on the Pile data:
```bash
bash scripts/miniplm/difference_sampling/1.8B.sh /PATH/TO/MiniPLM
bash scripts/miniplm/difference_sampling/104M.sh /PATH/TO/MiniPLM
```
Then, compute the difference scores $r(x,p,p_{\text{ref}})=\frac{\log p(x)}{\log p_{\text{ref}}(x)}$:
```bash
python3 scripts/miniplm/difference_sampling/compute_difference_scores.py /PATH/TO/MiniPLM
```
Finally, construct the refined pre-training corpus with the difference scores:
```bash
python3 scripts/miniplm/difference_sampling/construct_pretrain_data.py /PATH/TO/MiniPLM 0.5 # selection ratio
```
This process constructs a 50B-token corpus from a 100B-token corpus.

#### Pre-Training
```bash
bash scripts/miniplm/pretraining/qwen/200M.sh /PATH/TO/MiniPLM
bash scripts/miniplm/pretraining/qwen/500M.sh /PATH/TO/MiniPLM
bash scripts/miniplm/pretraining/qwen/1.2B.sh /PATH/TO/MiniPLM
```

#### KD Across Model Families
To distill the knowledge of Qwen models to Mamba or LLaMA3.1, first prepare the `config.json` and tokenization-related files in `checkpoints/mamba/130M` and `checkpoints/llama3.1/212M`. Then, convert the Qwen tokenization to the target tokenization:
```bash
bash scripts/tools/convert_tokenization/convert_tokenization_qwen_mamba.sh /PATH/TO/MiniPLM
bash scripts/tools/convert_tokenization/convert_tokenization_qwen_llama3_1.sh /PATH/TO/MiniPLM
```

NOTE: You may need to setup the environments following the official repo of [Mamba](https://github.com/state-spaces/mamba) before runing the mamba experiments.

### 4.2 Baselines
#### Conventional Pre-Training
```bash
bash scripts/pretrain/qwen/200M.sh /PATH/TO/MiniPLM
bash scripts/pretrain/qwen/500M.sh /PATH/TO/MiniPLM
bash scripts/pretrain/qwen/1.2B.sh /PATH/TO/MiniPLM
```

#### Vanilla KD
```bash
bash scripts/vanilla_kd/qwen/200M.sh /PATH/TO/MiniPLM
bash scripts/vanilla_kd/qwen/500M.sh /PATH/TO/MiniPLM
bash scripts/vanilla_kd/qwen/1.2B.sh /PATH/TO/MiniPLM
```

#### SeqKD
```bash
bash scripts/seqkd/qwen/200M.sh /PATH/TO/MiniPLM
bash scripts/seqkd/qwen/500M.sh /PATH/TO/MiniPLM
bash scripts/seqkd/qwen/1.2B.sh /PATH/TO/MiniPLM
```

#### MiniLLM
We use the official codebase of MiniLLM for this baseline.


## 5 Evaluation
#### LM-Evaluation-Harness
```bash
bash scripts/eval/harness.sh /PATH/TO/MiniPLM --model-path /PATH/TO/TRAINED_CKPT --ckpt-name NAME_OF_CKPT
```
NOTE: The `story_cloze` dataset may require manually downloading. Please follow the instructions in this [link](https://huggingface.co/datasets/LSDSem/story_cloze/blob/734b4e1771508f38d8a05f034b48a42986446669/story_cloze.py#L50) to download the test sets. After downloading, you will need to replace the task configuration file `lm-evaluation-harness/tasks/storycloze/storycloze_2018.yaml` with `configs/lm_harness_tasks/storycloze_2018.yaml`that refers to the downloaded directory.

#### Language Modeling
```bash
bash scripts/eval/lm.sh /PATH/TO/MiniPLM --model-path /PATH/TO/TRAINED_CKPT --ckpt-name NAME_OF_CKPT
```