---
license: apache-2.0
task_categories:
- question-answering
- video-text-to-text
language:
- en
tags:
- GPT-4V
- video
size_categories:
- n < 1M
---

# ShareGPTVideo Training Data
All dataset and models can be found at [ShareGPTVideo](https://huggingface.co/ShareGPTVideo).

# Contents:

- [Train 300k video frames](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k): contains video frames used for SFT and DPO model, which is a subset of total 900k.

  ActivityNet 50k + vidal 150k + webvid 100k.

- [Train 600k video frames](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_600k): contains the rest 600k frames, the total 900k frames are used for pre-training stage. If you just do finetuning using our video QA, you can just download the 300k above.

  900k composition is 400k WebVid + 450k Vidal + 50k ActivityNet

- [Instruction for DPO](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/video_instruction/train/dpo): 17k video preference data.

  **Training pipeline** refer to [LLaVA-Hound-DPO training](https://github.com/RifleZhang/LLaVA-Hound-DPO/blob/main/llava_hound_dpo/dpo_scripts/README.md)

- [900k Detailed Video Caption](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/blob/main/video_instruction/train/pretrain/video_caption_pretrain.jsonl): 900k video caption used for pretrain.
- [900k Video QA](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/blob/main/video_instruction/train/qa/chatgpt_qa_900k.jsonl): For the 300k video frames above, we generate 3 qa pairs for each, in total 900k. We only used 240k subset for SFT.



# Set up:
```bash
git clone git@github.com:RifleZhang/LLaVA-Hound-DPO.git
# fill in requirement path and token at: https://github.com/RifleZhang/LLaVA-Hound-DPO/blob/main/setup/set_path.sh
source setup/setup_env.sh
source setup/setup_train_data.sh
```

**Video Frames**:
```bash
video_data
└── train
...
keep updating
```

**Train instruction**:
```bash
video_instruction
├── train
│   ├── dpo
...
keep updating
```

# Preview examples
<details><summary>DPO Data</summary>
  
  ```bash
{'id': '--HUaAQKqgA_2',
  'prompt': 'How does the background fabric enhance the decorative appeal of the mirror cover?',
  'answer': 'The background fabric on which the mirror cover is displayed is golden with a shiny, patterned texture, which enhances the overall decorative appeal of the cover.',
  'video': '--HUaAQKqgA',
  'chosen': 'The background fabric has a gold hue, which adds to the reflective shine of the various stones and pearls that constitute the mirror cover, making it look even more ornate and luxurious.',
  'rejected': 'The background fabric enhances the decorative appeal of the mirror cover by providing a luxurious golden setting for the mirror.',
  'chosen_score': 5.0,
  'rejected_score': 3.0
}
```
</details>

<details><summary>Detailed Caption Data</summary>
  
  ```bash
{
  'id': 'xWwFJDHQQE4',
  'video': 'xWwFJDHQQE4',
  'conversations':
    [
      {
        'from': 'human',
        'value': "Provide a comprehensive analysis of the video's content and themes.\n<video>"
      },
      {
        'from': 'gpt',
        'value': 'The video opens with water coming to a boil in a stainless-steel pot. The second frame presents a close-up of a yellow packet of instant noodles with red and green accents, displaying text and images indicative of a masala flavor.
                  The subsequent frame shows a hand holding a seasoning sachet, preparing to tear it open. Another frame captures the block of dry noodles being placed into the boiling water, followed by a scene where a fork is used to stir the noodles as they cook.
                  The noodles change in consistency as they absorb the water and seasoning, turning softer and taking on a yellowish hue. As the video progresses, the fully cooked noodles are shown being served in a silver pot and then transferred onto a floral-design plate.
                  The final frame displays two children seated on a brown striped sofa, each with a plate of the prepared noodles, looking happy.'
      }
    ]
}
```
</details>

<details><summary>Video QA data</summary>
  
  ```bash
{
  'id': '---uhMEpio8_0',
  'video': '---uhMEpio8',
  'conversations':
    [
      {
        'from': 'human',
        'value': '<video>\nHow would you describe the design of the architectural structure in the video?'
      },
      {
        'from': 'gpt',
         'value': 'The architectural structure in the video has a tilted and angular appearance that defies conventional building structures.'
      }
    ]
}
```
</details>


<br>

# Issues and Requests
send to [project page](https://github.com/RifleZhang/LLaVA-Hound-DPO/tree/main)

# Citation
```
@article{zhang2024direct,
  title={Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward},
  author={Zhang, Ruohong and Gui, Liangke and Sun, Zhiqing and Feng, Yihao and Xu, Keyang and Zhang, Yuanhan and Fu, Di and Li, Chunyuan and Hauptmann, Alexander and Bisk, Yonatan and others},
  journal={arXiv preprint arXiv:2404.01258},
  year={2024}
}
```