# Pretraining Tiny Language Models on Mixed Web Text and Python

To replicate the pre-training experiments described in our paper, please first follow the instructions in `preparing_pt_data.md`. 

## Launching Experiments
Then, configure the pretraining dataset paths in the configuration files within `src/configs/pt_model/configs` accordingly.

Finally, launch experiments using the pretraining code implementation provided by the open-source OLMo project (included at `external/OLMo`) as shown below.

```
# Pretraining a 150M parameter model on a single node with 4 GPUs
torchrun --nproc_per_node=4 external/OLMo/scripts/train.py src/configs/pt_model_configs/tiny_code_lm_150M.yaml --save_overwrite
```

```
# Pretraining a 400M parameter model on a single node with 4 GPUs
torchrun --nproc_per_node=4 external/OLMo/scripts/train.py src/configs/pt_model_configs/tiny_code_lm_400M.yaml --save_overwrite
```

## Converting Saved Model Checkpoints to HuggingFace-Compatible Form
Once pretraining is complete, saved model checkpoints can be processed into HuggingFace compatible for by running the following code:

```
python external/OLMo/scripts/convert_olmo_to_hf_new.py --input_dir external/OLMo/runs/tiny_code_lm_150M_fw_10.4BT_pyt_60TB/$FINAL_CHECKPOINT_DIR --tokenizer_json_path external/OLMo/tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json --output external/OLMo/runs/tiny_code_lm_150M_fw_10.4BT_pyt_60TB/final-hf
```
```
python external/OLMo/scripts/convert_olmo_to_hf_new.py --input_dir external/OLMo/runs/tiny_code_lm_400M_fw_10.4BT_pyt_60TB/$FINAL_CHECKPOINT_DIR --tokenizer_json_path external/OLMo/tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json --output external/OLMo/runs/tiny_code_lm_400M_fw_10.4BT_pyt_60TB/final-hf
```

## Acknowledgements
We are grateful to the OLMo project for open-sourcing their pretraining code.

```
@article{groeneveld2024olmo,
  title={Olmo: Accelerating the science of language models},
  author={Groeneveld, Dirk and Beltagy, Iz and Walsh, Pete and Bhagia, Akshita and Kinney, Rodney and Tafjord, Oyvind and Jha, Ananya Harsh and Ivison, Hamish and Magnusson, Ian and Wang, Yizhong and others},
  journal={arXiv preprint arXiv:2402.00838},
  year={2024}
}
```