TL;DR: We propose Recap-DataComp-1B, the first publicly available image-text dataset with LLaMA-3–generated synthetic captions scaled to the billion level.
Abstract: Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and $\textit{open-sourced}$ LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption ~1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe an average of 3.1% enhanced zero-shot performance cross four cross-modal retrieval tasks using a mixed set of the original and our captions. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/.
Lay Summary: To understand images usually requires a lot of text describing what’s in those images — captions. But collecting billions of content rich image-caption pairs to train models is expensive and time-consuming. To address this, we created Recap-DataComp-1B, the first publicly available dataset with one billion synthetic captions generated by a powerful large language model, LLaMA-3.
While generating captions isn’t new, doing it at this scale is. This massive dataset helps researchers train and evaluate systems that connect images and language, like those that generate pictures from text or find images that match a written description. Our experiments show that models trained on Recap-DataComp-1B perform better at understanding long and complex image-text relationships.
By releasing this dataset to the public, we hope to accelerate progress in multimodal systems that learn from both images and text — and make cutting-edge tools more accessible to everyone. We believe this work sets a new standard for building large-scale, high-quality datasets with synthetic data.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/UCSC-VLAA/Recap-DataComp-1B
Primary Area: Deep Learning->Foundation Models
Keywords: image-text datasets; synthetic captions
Submission Number: 13249
Loading