# Scalable Pretraining of Retrieval Models 


This repository contains code to replicate our research. It is a fork of the [litgpt codebase](https://github.com/Lightning-AI/litgpt) edited to building and training a PSLM (Prefix-Suffix Language Model).

## Getting Started:
1. Add the following to your `~/.*rc` script.
```
module load gitlfs
module load Python3/3.11.2
alias python="python3" # makes "python" point to the above
```
_Recommended_: comment out any other conda related initialization block in your `~/.*rc` script, but you don't need to delete anything  else anywhere.

2. Clone the repo.
3. Run the installer:
   ```
   bash install_nvidia.sh
   ```
   This will create a conda environment at the location specified at the top of `install_nvidia.sh`. Regardless of whether you already have conda installed on the cluster, it downloads the miniforge distribution of conda to handle this.

## Pretraining (phase 1)

```
python train.py \
    --config=launch_configs/base_optim_longwu_highlr_cos.json \
    --run_name=test \
    --out_dir=/XXXX-36/XXXX-22/output \
    --keep_k_cross_device_negatives=368640 \
    --length_shortcut_ablation=truncate_lens_100_normal \
    --micro_batch_size=2 \
    --world_batch_size=8 \
    --negatives_cross_device_group_size=1 \
    --max_tokens=null \
    --max_steps=131900 \
    --warmup_steps=6000 \
    --optim_config.lr=2e-3 \
    --min_lr=2e-4 \
    --fabric_strategy=axonn_tp \
    --attn_impl=sdpa \
    --fabric.depth_tensor_parallel_size=1 \
    --save_n_min_before_job_done=5 \
    --wandb_tags='[prod,160m,v3,25_62_env]'
```

## Downloading and preparing models from HF for training in lit-gpt

All checkpoints from upstream huggingface models must be pulled down and a conversion util must be run before they can be loaded by lit gpt. Huggingface models can technically be loaded with `model_impl=huggingface`, but this does not scale, and will lead to funny errors for larger models.

This example uses git lfs to pull the whole model, but a huggingface `AutoModelForCausalLM.from_pretrained( )` call with a subsequent `model.save_pretrained(path)` call should also work for this downloading step.

```
# confirm this prints affirmatively, else goto step 1. in XXXX-38 setup above
git lfs install
> Git LFS initialized.

# XXXX-38
cd /fs/XXXX-37/llm-pretraining/models/external

export MODEL_REPO=TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T && \ 
git clone https://<USER>:<TOKEN>@huggingface.co/$MODEL_REPO

cd XXXX-40

# Conversion to lit format.
# eg. if working on XXXX-38
python scripts/convert_hf_checkpoint.py \
   --checkpoint_dir /fs/XXXX-37/llm-pretraining/models/external/TinyLlama-1.1B-intermediate-step-1431k-3T

# if you get a no .bin files error, use --from_safetensors true
```

## Preparing data for training (`PackedDataset`)

For datasets that can easily be loaded using huggingface's load_dataset function, then we have a simple 
utility `scripts/prepare_hf.py` that can tokenize a dataset into the scalable `PackedDataset` format that our
training code expects for large training runs.

The recommended procedure is that you get the dataset into the non-cached, disk format that hf provides, which is the result of
`Dataset.save_to_disk`. Then, you point the utility at it, which calls `load_from_disk` under the hood.
Alternately, it can also call `load_dataset` via flags, see below.

Note: for very large datasets that you don't want to fully download, see `notebooks/stream_sample.ipynb` for an example of pulling a sample of an hf dataset using the streaming argument to `load_dataset`.

```
python scripts/prepare_hf.py \
--dataset_name_or_path /path/to/my/dataset/saved/to/disk \
--ld_from_disk True \ # false means call load_dataset rather than load_from_disk
--add_bos False \
--add_eos True \
--checkpoint_dir /path/to/tokenizer \
--destination_path /path/to/my/packed/dataset

# see below to understand the important extra 
# arguments `chunk_size` and `num_shards` not shown here.
```

### On separator tokens and the `PackedDataset`'s `chunk_size` and `num_shards` parameters

One day-one, known downside about the PackedDataset implementation we took from upstream litgpt
is that writing fixed size arrays to disk (same number of tokens in every chunk) causes an issue 
where for some chunks/files, the tail end of the chunk is padded out with separators.

Therefore a tradeoff choice has to be made when each worker exhausts its rows at the end of the subset of the data it is tokenizing... The `skip_remainder` flag for the `prepare_hf.py` script switches from:
- `False` "trailing EOS's pad out the end of a file when the process run out of tokens"
- `True`  "we achieve zero padding by simply skipping writing the current chunk if it's not completely full"

The bound on the number of padding separators, or equivalently, the number of omitted tokens 
for the final packed dataset is a multiplicative factor of 1) how many workers/partitions of the data are being handled, 
and 2) how large the files/chunks are that you're writing (that must all be the same size)

In the worst case, the numnber of possible `pads==drops` is `num_shards * chunk_size`  since that's how many "last chunk"s you're going to have during the parallelized tokenization process.
It should immediately become apparent then that the pad overhead/drop amount can be lessened by either running with smaller `num_shards` or running with a smaller `chunk_size`.
However, for the former this means the whole process runs slower because there is less parallelism, and for the latter, if the inddividual chunks/files are smaller then to hold the same number of total tokens, there are many more output files generated.

Overall, we hope/think this has a reasonably small impact on performance, though we've not proven this conclusively.
Potential solutions include issue #30 or migrating to the litgpt upstream's newer StreamingDataset implementation which we hope doesn't have this issue.


### Managing configuration files

You can copy one of the quickstart files and extend it to manage your configuration. These configurations do not have to be exhaustive, the launched job will write its complete configuration into the output folder.

### Folder structure for launcher-managed jobs
The launcher will create the following layout. In the folder `output` of your current directory, a folder called `run_name` will be created. If you set `--uuid`, then your run name will be combining your original run name and a unique ID, to guarantee that each run has a unique folder. In any case, in this folder, you will find the sbatch files and a logs folder. All SLURM launches associated with your launch (with may be multiple) will log to separate files in the logs folder. All other artefacts will later be stored in the run folder, too.


## Misc

### A note on iterations and batch sizes

The current version of the training template is governed by setting the XXXX-13 tokens you want to train on, the _global_ number of tokens you want to contribute to each optimization step, and the micro batch size you want to pass through a single model forward call for each model copy.

So, when referring to overall batch sizes, we're typically operating in number of tokens (micro batch size is the exception). For example, the current tutorial script is configured for a batch size of ~4M. This number is achieved based on a context length/block size of 2048 and world batch size of 2048. Each node's batch size is derived based on the global batch size and number of nodes in the script. 

The code will log all hparams including the user-provided batch size and token args as well as those that are derived from them. It is always worth sanity checking if all your batch logic makes sense before launching your any full length run.


### Simple debug runs
Use 
```bash
python train.py --config=launch_configs/quickstart_laptop.json
```
to run a simple debug setup for a few steps. This also works if SLURM is not loaded, and without a GPU. The debug model and debug data are tiny and should run on any laptop.

### Multi-node training

The training script uses lightning Fabric to handle multi-node training - see [this tutorial](https://lightning.ai/docs/fabric/stable/guide/multi_node/slurm.html).

The key modifications depending on cluster are setting sbatch params (`gres`, `nodes`, `ntasks-per-node`) according to what is allowed on the cluster.

### Important things relating to multi-node
- The dataloading process involves splitting up the full training data uniformly across the available nodes and cards accoring to a combination of user passed random seed and the rank ids of all the workers. As a result, **once you start your run you must keep the same node of nodes** (when resuming from checkpoints, you can not use a different allocation unless switching data sources).
- For Tioga, you might need to use `flux batch -N 2 -n 16 sbatch_file.sh` 

### SLURM tricks

* Run `scancel job_id` immediately, if you think you messed up, or even `scancel --all`!
* Run `sacct -j jobid -o jobid%20,jobname%20,alloccpus%20,Start%20,elapsed%20,state,exitcode` to get an overview of job steps and timings

### Wandb logging
Wandb logs locally. However we can push the progress to https://wandb.ai at any time with [wandb sync](https://docs.wandb.ai/ref/cli/wandb-sync), and we can effectively log in real time with something like `watch -n 120 'wandb sync /output/quick_run_**/wandb/offline-run-**'`. For more info see issue XXXX-6.

### Sync with the official lit-gpt repo

This is a private clone of the lit-gpt repo, so doesn't directly work like a "fork".        
It's recommended to sync with the changes from the official repo every now and then. 
To do that, you'd need to add the official repo as an upstream locally: 

```bash
git remote add upstream https://github.com/Lightning-AI/lit-gpt.git
git remote set-url --push upstream DISABLE
```

When you push, do so on `origin` with `git push origin`.    
When you want to pull changes from `upstream` you can just fetch the remote and ~~rebase on top of your work.~~ pull or merge on top of your work. 
This creates a merge commit, which some people don't like, but is generally safer and preseves all history.
```bash
git fetch upstream

# git rebase upstream/main

git pull upstream main
# or
git pull upstream main --no-commit --no-ff
git commit
```
And solve the conflicts if any. 
(credit: [this tutorial](https://gist.github.com/0xjac/85097472043b697ab57ba1b1c7530274) on how to make a private fork. 
Optionally, you can use the `--no-commit --no-ff` "no commit and no fast-forward" flags to dry run the operation by performing the merge but not committing the result. Then the full list of changes can be inspected in vscode for example.)
