# Pretraining a Depth-Recurrent Model

This repo contains the code we used to train a recurrent-depth model at scale on 4096 AMD GPUs.

This repo is based on a fork of https://github.com/Lightning-AI/litgpt, which was very helpful to bootstrap our efforts, but little `litgpt` code remains at this stage.

This repo also contains all code to prepare the tokenizer and data, mostly in `scripts/`. 

## Code Setup:
*  The actual model definition is in `repre/model_dynamic.py`.
*  The training is orchestrated from `train.py`.
*  Model shapes can be found in `recpre/model_registry.py`. The final model is the shape `nebel-raven-3.5b`
*  The configurations for our two large-scale runs are in `launch_configs/`. 
* The parallelism implementation is deep down in `recpre/utils.py`, in a class called `SimpleFabric`. `_allreduce_chunk_stream` was used for inter-node communication, which was the only solution to remedy RCCL hangs at scale when using the OFI plugin, at the time of writing.

The code to run the model at inference is probablier easier to look at, if you just want to see the model architecture.
It can be found at `recpre/raven_modeling_minimal.py`.


## Reproducing Benchmark Scores

All benchmark scores reported in the paper are computed using the lm-eval harness, except for the code tasks, which are executed using bigcode. For default benchmarks, you can run `lm-eval` like so (no installation necessary):

```
lm_eval --model hf --model_args pretrained=MODEL_NAME,trust_remote_code=True,dtype=bfloat16,mean_recurrence=32 --tasks hellaswag --batch_size=auto --num_fewshot=0
```

For GSM8k, "w/ sys. prompt" refers to the following invocation, using this system prompt, and chat formatting:
```
lm_eval --model hf  \
--model_args pretrained=MODEL_NAME,trust_remote_code=True,dtype=bfloat16,mean_recurrence=32  \ 
--tasks gsm8k_cot  --batch_size=auto  --apply_chat_template=True --fewshot_as_multiturn \
--system_instruction="You are a helpful assistant that can assist users with mathematical reasoning."  \ 
```

## The grim details

What steps would you have to take if you were to replicate this model training and data collection run on an AMD cluster? Follow this outline:

1. Use `scripts/tokenizer_generation.py` to generate the tokenizer. Before you run the script, adapt all paths for your system. Data download is automatic. You also need the BPE trainer from https://github.com/gautierdag/bpeasy.
2. Run `scripts/scalable_data_download.py` to download all raw datasets. The name of the script is a lie, this is not so scalable, it will take a long time, lots of space and fail due to random errors. You'll also notice that there a number of extra rules hardcoded in the script for various badly formatting datasets. By the time you run this script, some of these authors may have updated their dataset breaking assumptions set here. You would get an error in that case, and would need to investigate that particular dataset. After this step, you'd have all raw datasets in a `staging` folder.
3. Run the `scripts/parquet_to_parquet_tokenizer.py` to generate the tokenized dataset. Again, remember to set your paths correctly.
4. After tokenizing, run the `scripts/parquet_to_parquet_shuffler.py` to shuffle the data.
5. Define your own launch config in `launch_configs/` or use our config, and launch `train.py` onto your cluster. Follow your cluster's best practices and environment flag guidelines when setting up a large-scale run. The core command is just `python train.py --config=launch_configs/your_config.yaml`.
6. Watch it train (hopefully). You can add additional segments to the training run as needed.

