# Data Pruning for SFT

## Environment Setup
```
conda env create -f eval.yml
conda activate eval
pip install lighteval@git+https://github.com/huggingface/lighteval.git@ed084813e0bd12d82a06d9f913291fdbee774905
pip install lighteval[math]
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install vllm==0.7.2
pip install deepspeed
pip install trl@git+https://github.com/huggingface/trl.git@69ad852e5654a77f1695eb4c608906fe0c7e8624
pip install liger-kernel==0.5.3
```

## Full dataset distillation from DeepSeek-R1 (Only Reasoning data)
```
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py     --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
```

## Tail-only dataset distillation from deepseek-R1 (four gpus)
```
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --config_file recipes/accelerate_configs/zero3_four.yaml src/open_r1/sft.py     --config recipes/Qwen2.5-Math-7B/config_tail.yaml
```

## Edge-only dataset distillation from deepseek-R1 (four gpus)
```
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --config_file recipes/accelerate_configs/zero3_four.yaml src/open_r1/sft.py     --config recipes/Qwen2.5-Math-7B/config_edge.yaml
```


