## 🛠️ Quick Start

### Installation

- It is recommended to build a Python-3.10 virtual environment using conda

  ```bash
  conda create --name mgllava-env python=3.10 -y
  conda activate mgllava-env
  ```

- Install XTuner from source

  ```shell
  cd MG-LLaVA
  pip install -e '.[all]'
  ```

### Data Preparation

Please refer to [dataset_prepare.md](dataset_prepare.md).

### Before Train
MG-LLaVA employed several LLMs ranged from 3.8B to 34B, including [Phi-3-3.8B](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct), [Vicuna1.5-7B](https://huggingface.co/lmsys/vicuna-7b-v1.5), [Vicuna1.5-13B](https://huggingface.co/lmsys/vicuna-13b-v1.5), [llama3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), and [Yi1.5-34B](https://huggingface.co/01-ai/Yi-1.5-34B-Chat). We employ [CLIP-Large-336](https://huggingface.co/openai/clip-vit-large-patch14-336) and [CLIP-ConvNext-320-d](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup) as vision encoders, you should download both the LLM and CLIP checkpoints before training.

The training process is similar to the original XTuner. Before training, you should check the [configs](mg_llava/config) and modify the following variables to your own settings. You can also modify the [configs](mg_llava/config) to train the model with your own settings.
  ```shell
  # Path of LLM and CLIP
  llm_name_or_path
  visual_encoder_name_or_path
  visual_encoder_aux_path
  prompt_template

  # Data
  data_path
  box_json_path
  image_folder
  offline_processed_text_folder(optional)

  # Training
  pretrained_pth(Fine-Tuning)
  ```
Before training, you can use the following command to preprocess the text data to speed up the training process. You can preprocess the text data by running the following command:

```shell
python xtuner/tools/process_untokenized_llava_data.py CONFIG --save-folder TEXT-PATH
```
and then set the `offline_processed_text_folder` in the config file to `TEXT-PATH`.

### Train & Evaluation
MG-LLaVA follows a two-stage training process, the entire training process takes approximately 23 hours when using the Vicuna1.5-7B model using 8×A100 GPUs. For example, to train the MG-LLaVA model with Vicuna1.5-7B, you can use the following command:


- **Entire Pipeline**: Pretraining + Fine-tuning + Evaluation

  ```shell
  bash script/train_vicuna7B.sh
  ```

If you want to train our model step by step, you can follow the instructions below:

- **Step 1**, start pretraining.
  ```shell
  bash script/train_pretrain.sh mg_llava/config/vicuna/fuse_vicuna7b_clip_L_14_336_pretrain_padding.py
  ```

- **Step 2**, start fine-tuning.

  ```shell
  bash script/train_sft.sh mg_llava/config/vicuna/fuse_vicuna7b_clip_L_14_336_sft_padding.py
  ```

  - `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) 🚀 to optimize the training. XTuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.

  - For more examples, please see [finetune.md](./docs/en/user_guides/finetune.md).

- **Step 3**, evaluation. The evaluation metrics are specified in the sft configuration, including MMBench, SEED, SQA, AI2D, TextVQA, POPE, GQA, VQAv2, and additional ones. Please refer to [evaluation.md](evaluation.md).

  You can convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by

  ```shell
  xtuner convert pth_to_hf CONFIG_NAME_OR_PATH CHECKPOINT SAVE_PATH
  ```
<a id="inference"></a>
### Inference
Before inference, you need to prepare MG-LLaVA checkpoints and corresponding LLM model. In addition, [CLIP-Large-336](https://huggingface.co/openai/clip-vit-large-patch14-336), [CLIP-ConvNext-320-d](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup), [RAM](https://huggingface.co/xinyu1205/recognize-anything-plus-model/blob/main/ram_plus_swin_large_14m.pth) and [OWL-VIT-2](https://huggingface.co/google/owlv2-large-patch14-ensemble) are also required.


The code for inference is available at [chat.py](mg_llava/module/chat.py). You can use the following command to run the inference code in [chat.sh](script/chat.sh) and chat with MG-LLaVA.

```
srun -p mllm_1 \
    --gres=gpu:1 \
    python mg_llava/module/chat.py \
    PATH TO MG-LLaVA-Vicuna-7B MODEL \
    --llm_name_or_path 'PATH TO Vicuna1.5-7B LLM' \
    --visual_encoder_clip 'PATH TO CLIP MODEL' \
    --visual_encoder_convnext 'PATH TO ConvNext MODEL' \
    --ram_model 'PATH TO RAM MODEL' \
    --owl_vit_model 'PATH TO OWL-VIT-2 MODEL' \
    --prompt-template 'vicuna' \
    --image examples/example.jpg
```

