Abstract: The increasing availability of image-text pairs has largely fueled the rapid advancement in vision-language foundation models. However, the vast scale of these datasets inevitably introduces significant variability in data quality, which can adversely affect the model per- formance. This highlights the critical role of data filtering, not only to enhance training ef- ficiency but also to improve overall data quality. Existing methods typically rely on metrics such as CLIP Score and BLIP Score, which are derived from pre-trained models. However, these models are often trained on uncurated, noisy datasets, which can perpetuate errors and misalignments in the filtered dataset. We present a novel algorithm that incorporates human knowledge on image-text alignment to guide filtering vast corpus of web-crawled image-text datasets into a compact and high-quality form. To systemically capture human preferences on image-text alignments, we collect a diverse image-text dataset where each image is as- sociated with multiple captions from various sources, and establish a comprehensive set of both subjective and objective criteria for critically guiding the alignment assessment from labelers. Additionally, we train a reward model on these human-preference annotations to internalize the nuanced human understanding of image-text alignment. The resulting reward model thus can act as a human-like referee to filter image-text pairs. Extensive ex- periments demonstrate that we can maintain, sometimes even improve, model performance while compressing the image-text datasets up to ∼90%. An impressive example is that, by aggressively reducing the total training sample from 130M to only 15.5M, our BLIP-B/16 models consistently show an average improvement of 2.9% on retrieval tasks and 11.5% on captioning tasks compared to full-size-dataset counterparts.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Xuming_He3
Submission Number: 3236
Loading