TL;DR: We propose a method to automatically identify label errors in image captions using the multimodal neighborhood of image-caption pairs.
Abstract: Large repositories of image-caption pairs are essential for the development of vision-language models. However, these datasets are often extracted from noisy data scraped from the web, and contain many mislabeled instances. In order to improve the reliability of downstream models, it is important to identify and filter images with incorrect captions. However, beyond filtering based on image-caption embedding similarity, no prior works have proposed other methods to filter noisy multimodal data, or concretely assessed the impact of noisy captioning data on downstream training. In this work, we propose, theoretically justify, and empirically validate LEMoN, a method to identify label errors in image-caption datasets. Our method leverages the multimodal neighborhood of image-caption pairs in the latent space of contrastively pretrained multimodal models to automatically identify label errors. Through empirical evaluations across eight datasets and twelve baselines, we find that LEMoN outperforms the baselines by over 3% in label error detection, and that training on datasets filtered using our method improves downstream captioning performance by more than 2 BLEU points over noisy training.
Lay Summary: Billions of image-caption pairs scraped from the web fuel today’s vision‑language AI. However, many of these samples are wrong, and the captions do not match the accompanying image. Models trained on these mislabeled instances may then underperform or learn harmful associations, which is especially worrying in fields like medicine.
Our work introduces LEMoN, a method for automatically detecting these mismatched pairs. It looks not only at how well each image matches its own caption, but also how both compare with their closest "neighbors" in a shared vision‑language space. If the neighbors disagree, the pair is probably mislabeled. We test our method on eight diverse datasets, from everyday photos to chest‑X‑rays, and find that LEMoN does better at detecting mislabeled examples than the best prior tools.
By helping researchers clean their data, or surface suspicious samples for expert review, LEMoN paves the way for more reliable and trustworthy AI systems.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/MLforHealth/LEMoN
Primary Area: Social Aspects->Robustness
Keywords: label error detection, noisy labels, image captions
Submission Number: 4968
Loading