# Lumina-mGPT Training

For efficiency considerations, the multi-modal datasets are pre-tokenized into sequences of token ids. This leads to significantly faster training

## Pre-tokenization


### 1. Run Tokenization

This stage tokenizes each data point, consisting of interleaved image and text, into a single sequence of integer tokens. After tokenization, the sequence is saved to disk for trainining-time usage. Together with the saved tokens, a json-formatted record file is also generated for indexing all the saved token files. For faster tokenization, you may use multiple GPUs and dispatch different subsets of data to them.

#### Command:

```bash
for i in {0..7}
do
  export CUDA_VISIBLE_DEVICES=${i}
  python -u pre_tokenize/pre_tokenize.py \
  --splits=8 \
  --rank=${i} \
  --in_filename /path/to/in_filename.json \
  --out_dir /path/to/out_dir \
  --target_size 768 &> ${i}.log &
done
```

#### Format of Input File:

`in_filename` is expected to be a json file with the following format:
```python
[
    {...},
    {...},
    {
        "conversations":[
            {
                "from": "human",
                "value": "Hi, please convert this depth image <|image|> to a color image"
            },
            {
                "from": "gpt",
                "value": "<|image|>"
            },
            {
                "from": "human",
                "value": "Could you change its style to that of this image? <|image|>"
            },
            {
                "from": "gpt",
                "value": "Sure, here's the image. <|image|>"
            }
        ],
        "image": ["/path/to/image1.png", "path/to/image2.png", "path/to/image3.png", "path/to/image4.png"]
    },
    {...},
    {...}
]
```

*Rules:*

1. The file is a list of dictionaries, and each dictionary represents a data point
2. Each dictionary contains the key "conversations"
3. If the conversation involves image(s), the data point should also contain the `image` key, otherwise the `image` key can be omitted
4. The location of each image should be explicitly specified in the conversation using the `<|image|>` symbol
   1. Apparently, the number of occurrences of the `<|image|>` symbol should be equal to the number of images in the `image` key


#### How to adapt to your own format:

If you have your own data with a different format, you can easily adapt the code to deal with it by modifying the `pre_tokenize.py` file.
We have prepared the space, which is in `ItemProcessor.process_item`, for adding your logic that converts data points of your own format into the standard format.

### 2. Concat Records

After tokenization, You need to concat the record files generated by different processes (GPUs) into one single record file.
**Note that we use the term "record file" to refer to the meta file that contains the information of all the saved token files,
which is different from the token files themselves.**

```bash
python -u ./lumina_mgpt/pre_tokenize/concat_record.py --sub_record_dir ./data/civitai/mgpt_format --save_path ./data/civitai/mgpt_format/record.json
```

## Training

#### Command:
We provide an example experiment scripts [exps/7B.sh](exps/7B.sh) for training the 7B model. Suppose you have access to a SLURM clsuter, you can run the following command to start training:

```bash
srun -n32 --ntasks-per-node=8 --gres=gpu:8 bash exps/7B.sh
```

Otherwise, if your want to use `torchrun` for distributed training, you can make the following change to `exps/7B.sh`:
```bash
# python -u finetune_solver.py \
torchrun --torchrun_kwargs finetune_solver.py \
```

#### About the `--data_config` argument:
The ``--data_config`` argument should point to a `*.yaml` file, which is a meta file that gathers one or multiple record files.
In other words, you may pre-tokenize multiple datasets independently and list the record files in the same data config file
for joint training.



torchrun --standalone --nnodes=1 --nproc-per-node=8 ./lumina_mgpt/finetune_solver.py  --model_size 7B --batch_size 8 --accum_iter 1 --epochs 2 --warmup_epochs 0.01 --lr 2e-5 --min_lr 2e-5 --wd 0.1 --clip_grad 4 --data_config "./lumina_mgpt/configs/data/ft_civi.yaml" --cache_ann_on_disk --num_workers 8 --output_dir "./workdir/7B-civi-1m" --save_iteration_interval 1000 --checkpointing --max_seq_len 4096 --unmask_image_logits --dropout 0.05 --z_loss_weight 1e-5  2>&1 | tee -a "./workdir/7B-civi-1m/output.log"