OVid: Open Large-Scale Video Dataset as a Novel Source for Image-Text Data

ICLR 2026 Conference Submission20250 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: video dataset, recaptioning, clip, open foundation models, open datasets
Abstract: We present OVid, a large open video dataset comprising _10 million hours_ of diverse content collected from CommonCrawl. To complement the raw data, we generate image captions for scene-changing frames and video-level captions for a 300M frame–caption subset. Using this subset, we train CLIP models at multiple scales and benchmark them against reference CLIP models trained on DataComp, Re-LAION and DataComp recaptioned with the same captioning pipeline. Observed scaling trends for classification and retrieval show evidence that OVid can be another valuable and scalable source of image-text data, in addition to image-text pairs from public webpages. OVid marks a significant step towards democratizing access to large-scale video data and fostering the development of open multimodal foundation models. To this end, all the data will be freely available to research institutions.
Primary Area: datasets and benchmarks
Submission Number: 20250
Loading