# VLFeedback

A GPT-4V annotated preference dataset for large vision language models.

[[Project Page]](https://vlf-silkie.github.io)  [[Datasets]](https://huggingface.co/datasets/MMInstruction/VLFeedback) [[Silkie Model]](https://huggingface.co/MMInstruction/Silkie) [[Paper]]()

## Annotation Framework

<img src="imgs/annotate_framework.png" width="800px">


### Multimodal Instruciton Source

The instructions are sampled from various domains to cover different capabilities of LVLMs


<img src="imgs/instruction_source.png" width="800px">


### Model Pool

We construct a model pool consists of 12 LVLMs, including

- GPT-4V
- LLaVA-series
  - LLaVA-v1.5-7B
  - LLaVA-v1.5-13B
  - LLaVA-RLHF-7b-v1.5-224
  - LLaVA-RLHF-13b-v1.5-336
- Qwen-VL-7B
- IDEFICS-9b-Instruct
- Fuyu-8B
- InstructBLIP-serise
  - InstructBLIP-Vicuna-7B
  - InstructBLIP-Vicuna-13B
- VisualGLM-6B
- MMICL-Vicuna-13B



## Silkie

We select Qwen-VL-Chat as the backbone model and perform DPO on our dataset.

<div align="center">
    <img src="imgs/silkie.png" alt="Silkie Logo" width="128px">
<p>Generated by <a href="https://openai.com/dall-e-3">DALL·E 3</a></p>
</div>

The resulting model, Silkie, achieves comprehensive improvements on various benchmarks


<img src="imgs/silkie_ret.png" width="800px">

### Installation

To run our training scripts, create a virtual environment and install the dependencies first.

```bash
conda create -n silkie python=3.10  && conda activate silkie
pip install -r requirements.txt
```

### Training

Our training scripts support both single-node and multi-node training.
We provide a `launch_dpo.py` script that handles both cases. If you want to launch a job locally, you can use:

```bash
python launch_dpo.py --config dpo_config/example.yaml --working $WORKING_DIR
```

If you want to launch a job on a Slurm cluster, specify `GPUS_PER_NODE` in `launch_dpo.py` and run:

```bash
python launch_dpo.py --config dpo_config/example.yaml --working $WORKING_DIR --gpus $NUM_GPUS
```

## Citations

```bib
@article{2023vlfeedback,
  author      = {Lei Li and Zhihui Xie and Mukai Li and Shunian Chen and Peiyi Wang and Liang Chen and  Yazheng Yang and  Benyou Wang and  Lingpeng Kong},
  title       = {Silkie: Preference Distillation for Large Visual Language Models},
  publisher   = {arXiv:2312.10665},
  year        = {2023}
}
```

## Acknowledgements

We would like to thank the authors of [trl](https://github.com/huggingface/trl) and [Qwen-VL](https://github.com/QwenLM/Qwen-VL) for their great work.