

## SpectrumKD: Dynamic Dataset Curation for Distribution-Aware Knowledge Distillation of Large Language Models 
## 1 Requirements 
```bash 
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121 
pip install numpy==1.24.0 
pip install transformers==4.51.1 
pip install vllm==0.5.0 
pip install deepspeed==0.16.5 
pip install nltk==3.9.1 
pip install numerize==0.12 
pip install rouge-score==0.1.2 
pip install torchtyping==0.1.5 
pip install rich==14.0.0 
pip install accelerate==1.2.1 
pip install datasets==3.2.0 
pip install sentencepiece 
pip install protobuf==4.23.4 
pip install peft==0.14.0 
``` 
or 
```bash 
bash install.sh 
``` 
Please download the pretrained model checkpoints and put them in the checkpoints/ folder before running the training or evaluation scripts. 
## 2 Data Processing 
The raw datasets can download by following scripts. 
```bash 
huggingface-cli download MiniLLM/dolly --repo-type dataset /PATH_TO/LMOps/spectrumKD/data/dolly/ 
huggingface-cli download MiniLLM/self-inst --repo-type dataset /PATH_TO/LMOps/spectrumKD/data/self-inst/ 
huggingface-cli download MiniLLM/Vicuna --repo-type dataset /PATH_TO/LMOps/spectrumKD/data/vicuna/ 
huggingface-cli download MiniLLM/sinst --repo-type dataset /PATH_TO/LMOps/spectrumKD/data/sinst/ 
huggingface-cli download MiniLLM/uinst --repo-type dataset /PATH_TO/LMOps/spectrumKD/data/uinst/ 
huggingface-cli download openai/gsm8k --repo-type dataset /PATH_TO/LMOps/spectrumKD/data/gsm8k/ 
huggingface-cli download Muennighoff/mbpp --repo-type dataset /PATH_TO/LMOps/spectrumKD/data/mbpp/ 
huggingface-cli download IWSLT/iwslt2017 --repo-type dataset /PATH_TO/LMOps/spectrumKD/data/iwslt2017/ 
huggingface-cli download EdinburghNLP/xsum --repo-type dataset /PATH_TO/LMOps/spectrumKD/data/xsum/ 
``` 
Before running the training or evaluation scripts, please preprocess the datasets. The template of processing data are as follows: 
```bash 
bash scripts/gpt2/tools/process_data_dolly_spectrumKD.sh 
bash scripts/openllama2/tools/process_data_dolly_spectrumKD.sh 
``` 
These scripts will generate processed data in the appropriate directories under processed_data/. 
## 3 Training 
We provide example commands for GPT-2 models. All our experiments are conducted on 4 \* A800-80G, which can be reduced for small models. 
### 3.1 Baselines 
#### Fine-tune the teacher models 
```bash 
bash scripts/gpt2/sft/sft_xlarge.sh /PATH_TO/spectrumKD 
``` 
#### SFT Baseline 
```bash 
bash scripts/gpt2/sft/sft_base.sh /PATH_TO/spectrumKD 
``` 
#### SeqKD Baseline 
```bash 
bash scripts/gpt2/seqkd/seqkd_base.sh /PATH_TO/spectrumKD 
``` 
#### GKD Baseline 
```bash 
bash scripts/gpt2/gkd/gkd_base.sh /PATH_TO/spectrumKD 
``` 
#### KD series Baselines 
```bash 
bash scripts/gpt2/kd/kd_base.sh --type kd 
bash scripts/gpt2/kd/kd_base.sh --type rkl 
bash scripts/gpt2/kd/kd_base.sh --type jsd 
bash scripts/gpt2/kd/kd_base.sh --type tvd 
bash scripts/gpt2/kd/kd_base.sh --type sfkl 
bash scripts/gpt2/kd/kd_base.sh --type srkl 
``` 
### 3.2 spectrumKD 
```bash 
bash scripts/gpt2/spectrumKD/spectrumKD.sh --type tfkl 
bash scripts/gpt2/spectrumKD/spectrumKD.sh --type trkl 
bash scripts/gpt2/spectrumKD/spectrumKD.sh --type tjsd 
bash scripts/gpt2/spectrumKD/spectrumKD.sh --type ttvdf 
bash scripts/gpt2/spectrumKD/spectrumKD.sh --type tsfkl 
bash scripts/gpt2/spectrumKD/spectrumKD.sh --type tsrkl 
``` 
## 4 Run Evaluation 
```bash 
bash scripts/gpt2/eval/run_eval.sh /PATH_TO/spectrumKD 
``` 
