# Chain-of-Learngene: A Scalable Learngene-based Paradigm for Building and Initializing Variable-Sized Language Models

## 1. Note

The code is based on the official code [Minillm](https://github.com/microsoft/LMOps/tree/main/minillm). Before installing the specific environment of COL, you should first follow the steps of Minillm for installation and downloading the pre-processed datasets.

## 2. CoL Setup

```
pip3 install mpu
pip3 install accelerate==0.34.2
pip3 install torchtypin
pip3 install transformers
pip3 install deepspeed==0.15.0
pip3 install tokenizers==0.14.1
pip install --upgrade --force-reinstall certifi
pip install --upgrade datasets huggingface_hub
pip install torchtyping rouge_score
pip install --upgrade transformers tokenizers
pip3 install --no-cache-dir -e /opt/dpcvol/models/pkge/transformers-minillm/. 
pip3 install thop
pip3 install pytorch_model_summary

pip3 uninstall py-cpuinfo -y
pip3 install py-cpuinfo
```

## 3. Downloading and Pre-processing Dataset
### 3.1 Pre-training Dataset
* This paper uses two pre-trained datasets (OpenWebText, OpenWebText-100K). You can download the pre-processed OpenWebText through the following link.
    ```
    huggingface-cli download MiniLLM/openwebtext-processed --repo-type dataset --local-dir /PATH_TO/LMOps/minillm/processed_data/openwebtext/gpt2/512/10M/ # Optional
    ```
* Unprocessed OpenWebText-100K can be downloaded via:
    ```
    huggingface-cli download --repo-type dataset --resume-download Elriggs/openwebtext-100k --local-dir ./
    ```
    You can optionally pre-tokenize OpenWebText-100K with the following command:
    ```
    bash scripts\gpt2\tools\process_data_pretrain.sh
    ```
### 3.2 SFT Dataset
* For SFT tasks, you can download your own dataset by executing the following bash file. In this file, you need to replace the dataset name and path with your own path.
    ```
    bash download_dataset.sh
    ```
* After downloading the dataset, you can use the following link to pre-process the sft dataset:
    ```
    bash scripts/gpt2/tools/process_data_dolly.sh BASE_PATH # Process Dolly Train / Validation Data
    ```
    Please note that 'process_data_dolly.py' only provides an example of pre-processing the sft dataset, and you should change the variable 'template' for your specific task.

## 4. Train and Evaluate

### 4.1 Generating Initialization Parameters with LInit
* To generate parameters for DesNets from CoL, you can use the following bash file.
    ```
    bash scripts\gpt2[llama3/qwen3]\learngene\initialize_desnet.sh
    ```
* In your bash file,  you should change the following variables:

    |   Variables           |   Function   |
    | ---- | ---- |
    |   big_model_path      | The model path of the larger checkpoint in the learngene chain.    |
    |   small_model_path    | The model path of the smaller checkpoint in the learngene chain.   |
    |   middle_model_path   |  The config path of the DesNet. Here, you can customize the layer num, head num, head dim, and intermediate dim.  |
    |   output_dir          |  The path where DesNet parameters are saved.   |

### 4.2 Pre-train with/without LInt
* After generating the initialization parameters, you can resume pre-training DesNet using the following example link:
	```
	bash scripts\gpt2[llama3/qwen3]\learngene\pretrain_hf\pretrain_hf_138M_78M-LInit.sh
	```
	Note that '138M' refers to the number of DesNet parameters, and '78M' refers to the size of the pretrained corpus. You can also customize these by adding a new bash file.
* If you are pre-training DesNet from scratch, you can use the following link:
	```
	bash scripts\gpt2[llama3/qwen3]\learngene\pretrain_hf\pretrain_hf_138M_78M.sh
	```
	
### 4.3 SFT on the Downstrem Datasets
* Here, we provide the commands for fine-tuning DesNet.
	```
	bash scripts\gpt2[llama3/qwen3]\sft\sft_desnet_boolq_gpt2-138M.sh
	```
	'gpt2-138M' indicates the number of parameters of DesNet, and 'boolq' indicates the downstream task.
### 4.4 Evaluate the Fine-tuned DesNets
* After fine-tuning DesNet, you can evaluate DesNet by executing the bash file in the following folder: `scripts\gpt2[llama3/qwen3]\eval\`.

## 5. Citation
```
@inproceedings{minillm,
  title={MiniLLM: Knowledge Distillation of Large Language Models},
  author={Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie},
  booktitle={Proceedings of ICLR},
  year={2024}
}
```







