# reproduce-physics-of-language-models
Reproduce of paper physics of language models

## Generate Data

Here's the data split:
| | 0k-50k | 50k-60k | 60k-100k | 100k-110k |
| --- | --- | --- | --- | --- |
| Bios | Pretraining | Pretraining | Pretraining | NA |
| QAs | SFT | Mix into pretraining | Testing | Unknown |

- Pretraining Phase: Bios pretraining(97%)+QAs(3%, 50k-60k)
- SFT: QAs as SFT, use 0k-50k
- SFT with Unknown Data: Mix 100k-110k data into SFT

```bash
# generate the data described above
python generate_bios.py
# RESULT_PATH in generate_bios.py
# 10000 to target size

# generate binary files
python convert_binary.py -i {input_folder} -o {output_folder} --val_shard_size 1000000
```

## Training

### Pretraining

Training with mixed data:

```bash
torchrun --standalone --nproc_per_node=1 train_gpt2.py --input_folder hallucinate_small/pretrain_perturbed_mixed --save_every 2000 --val_loss_every 2000 --run_name xs_pretrain_small --warmup_ratio 0.05 --warmdown_ratio 0.9 --sequence_length 512 --device_batch_size 32 --num_epochs 4 --weight_decay 0.1 --learning_rate 0.0003 --batch_size 32 --bf16 --model_size small --output_dir temp_log
```

## Data Source
- Company data is from https://www.kaggle.com/datasets/rm1000/fortune-500-companies?resource=download

- College data is from https://www.kaggle.com/datasets/yashgpt/us-college-data