# Datasets

For Open-Sora 1.2, we conduct mixed training with both images and videos. The main datasets we use are listed below.
Please refer to [README](/README.md#data-processing) for data processing.

## Video

### Webvid-10M

[Webvid-10M](https://github.com/m-bain/webvid) contains 10 million video-text pairs scraped from the stock footage sites.
We first train the model on this dataset (40k hours) for 30k steps (2 epochs).

### Panda-70M

[Panda-70M](https://github.com/snap-research/Panda-70M) is a large-scale dataset with 70M video-caption pairs.
We use the [training-10M subset](https://github.com/snap-research/Panda-70M/tree/main/dataset_dataloading) for training,
which contains ~10M videos of better quality.

### Mixkit

[Mixkit](https://mixkit.co/) is a video website where we obtained 9k videos.

### Pixabay

[Pixabay](https://pixabay.com/videos/) is video website where we obtained 60.5k videos.

### Pexels

[Pexels](https://www.pexels.com/) is a popular online platform that provides high-quality stock photos, videos, and music for free.
Most videos from this website are of high quality. Thus, we use them for both pre-training and HQ fine-tuning.
We really appreciate the great platform and the contributors!

### Inter4K

[Inter4K](https://github.com/alexandrosstergiou/Inter4K) is a dataset containing 1K video clips with 4K resolution.
The dataset is proposed for super-resolution tasks. We use the dataset for HQ fine-tuning.

### HD-VG-130M

[HD-VG-130M](https://github.com/daooshee/HD-VG-130M?tab=readme-ov-file) comprises 130M text-video pairs.
The caption is generated by BLIP-2.
We find the scene and the text quality are relatively poor. For OpenSora 1.0, we only use ~350K samples from this dataset.

### MiraData

[MiraData](https://github.com/mira-space/MiraData): a high-quality dataset with 77k long videos, mainly from games and city/scenic exploration.


### Vript

[Vript](https://github.com/mutonix/Vript/tree/main): a densely annotated dataset of 400k videos.


## Image

### Midjourney-v5-1.7M

[Midjourney-v5-1.7M](https://huggingface.co/datasets/wanng/midjourney-v5-202304-clean) includes 1.7M image-text pairs.
In detail, this dataset introduces two subsets: original and upscale.
This dataset is proposed for exploring the relationship of prompts and high-quality images.

### Midjourney-kaggle-clean

[Midjourney-kaggle-clean](https://huggingface.co/datasets/wanng/midjourney-kaggle-clean) is a reconstructed version of [Midjourney User Prompts & Generated Images (250k)](https://www.kaggle.com/datasets/succinctlyai/midjourney-texttoimage?select=general-01_2022_06_20.json%5D), which is cleaned by rules.
Moreover, this dataset is divided into two subsets: original and upscale.
This dataset is proposed for enabling research on text-to-image model prompting.

### Unsplash-lite

The [Unsplash-lite](https://github.com/unsplash/datasets) Dataset comprises 25k nature-themed Unsplash photos, 25k keywords, and 1M searches.
This dataset covers a vast range of uses and contexts. Its extensive scope in intent and semantics opens new avenues for research and learning.

### LAION-AESTHETICS 6.5+

LAION aesthetic 6.5+ dataset is a subset of the LAION dataset, which contains 625K high-quality images with aesthetic scores > 6.5. However, as LAION is currently not publicly available, we use this 168k [subset](https://huggingface.co/datasets/bhargavsdesai/laion_improved_aesthetics_6.5plus_with_images).
