## Quick Start 

This project is based on the [TinyLlama](https://github.com/jzhang38/TinyLlama) project. It has been adapted to support pretraining with context window scheduling, intra-document masking, etc.

###  Installation
If you already an environment built for [TinyLlama](https://github.com/jzhang38/TinyLlama/blob/main/PRETRAIN.md), you can directly use it. 
Otherwise, please use the following commands to build a new environment. 
Here, we expect a CUDA version of 11.8
```bash
conda create -n ladder-pretrain python=3.8
conda activate ladder-pretrain
# install the latest compatible version of torch and xformers, this should install torch 2.4.1
pip install ninja
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118 
# install flash attention
git clone --branch v2.3.3 --depth 1 https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
python setup.py install
cd csrc/rotary && pip install .
cd ../layer_norm && pip install .
cd ../xentropy && pip install .
cd ../.. && rm -rf flash-attention
# install other dependencies 
pip install -r requirements.txt
```

### Data preparation
The data preparation process is the same as the original tinyllama project.
First make sure that your data is in jsonl format in one directory of the following structure:
```
TEXT_DIR
├── cc
│   ├── train
│   │   ├── 0.jsonl
│   │   ├── 1.jsonl
│   │   └── ...
│   └── valid
│       ├── 0.jsonl
│       ├── 1.jsonl
│       └── ...
└── ...
```

Then run the following:

```bash
export TEXT_DIR=<YOUR_TEXT_DIR>
export BINS_ROOT=<YOUR_BIN_DIR> # where to store the processed chunks
bash scripts/pajama_processing.sh cc 8k
```
where `cc` is the dataset name and `8k` is the sequence length (supporting from 512 to 16k). 
The `TEXT_DIR` is the directory where the text data is stored and the `BIN_DIR` is the directory where the processed data will be stored.
After this step, you will have the data in the following structure:
```
BINS_ROOT
├── cc_8k
│   ├── train_0.bin
│   ├── train_1.bin
│   ├── ...
│   ├── valid_0.bin
│   ├── valid_1.bin
│   └── ...
└── ...
```

### Pretraining
Next, you can start pretraining by running the following:
```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY> # if you want to log into wandb
export BINS_ROOT=<YOUR_BIN_DIR> # from the previous data preparation step
bash scripts/pretraining.sh tiny_LLaMA_1b_8k cc_8k cc_8k # replace 1b_8k with 120M_8k or 360M_8k for smaller models
```
The general usage of `pretraining.sh` is `bash scripts/pretraining.sh model_config_name train_dataset_name eval_dataset_name`. For instance, 
`tiny_LLaMA_1b_8k` is the model config name, `cc_8k` is the training dataset name, and `cc_8k` is the evaluation dataset name. 
The script will look for bins created in the previous step. Those with a `train_*` prefix are used for training and those with a `valid_*` prefix are used for evaluation.

You can simply replace the model config name to get different models:
```
bash scripts/pretraining.sh tiny_LLaMA_1b_8k cc_8k cc_8k # baseline with standard causal attention
bash scripts/pretraining.sh tiny_LLaMA_1b_8k_intramask cc_8k cc_8k # intradocument masking
bash scripts/pretraining.sh tiny_LLaMA_1b_8k_dm8 cc_8k cc_8k # skyladder with alpha=1/8 
bash scripts/pretraining.sh tiny_LLaMA_1b_8k_intradm8 cc_8k cc_8k # intradocument masking + skyladder with alpha=1/8
```
Here, `dm8` means that $\alpha$ is 1/8: the local window size $w$ will increase by 1 every 8 steps. Therefore, it takes 64k steps to reach 8k. In our implementation, `dm1` is the fastest (8k steps to reach 8k) and `dm8` is the slowest (64k steps to reach 8k).

On a node with 8 A100 (40G) GPUs, the pretraining of a 1B model with 8k context, 100B token takes around 10 days. 
If you wish to get the results faster, do consider using a smaller model. For instance, the 120M (`tiny_LLaMA_120M_8k`) model takes around 1 day to pretrain with 100B tokens. 
Additional pretraining setups (learning rate, batch size, max steps, etc.) should be changed in the [`pretrain/tinyllama.py`](pretrain/tinyllama.py) file.

### Advanced Usage

#### Multi-node Pretraining
Alternatively, if you are running on multiple nodes, you can use the following command:
```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY> # if you want to log into wandb
export BINS_ROOT=<YOUR_BIN_DIR> # from the previous data preparation step
export NUM_NODES=4 # number of nodes, adjust accordingly
bash scripts/pretraining_multi.sh tiny_LLaMA_1b_8k cc_8k cc_8k
```
This will run the pretraining on multiple nodes. 

#### Intra-Document Masking
We implemented intra-document masking (which can be combined with SkyLadder). 
The model name of `tiny_LLaMA_1b_8k_intramask` means that the model will be trained with intra-document masking. 
To combine with SkyLadder, use suffices like `intradm8` ($\alpha=1/8$), `intradm4`, etc.

#### Other types of schedules 
You can also find other types of schedules we experimented with in our paper.
For instance, `tiny_LLaMA_1b_8k_sin8` means that the schedule is a sinusoidal schedule with $\alpha$ being 1/8. 
We support linear (`dm8`), sinusoidal (`sin8`), and exponential (`exp8`) schedules. 
There are two modes we support, based on (1) the rate of increasing the context window or (2) the percentage of training tokens with an increasing context window. 
1. `{schedule-type}{rate}` where `rate` is $1/\alpha$. For instance, `sin8` means that the context window will increase by 1 every 8 steps.
2. `{schedule-type}{scheduling-percent}p` where `scheduling-percent` is the percentage of training tokens with an increasing context window, "climbing the ladder". For instance, `sin70p` means that 70% of the training tokens will have an increasing context window, following a sinusoidal schedule.


