LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce a new OOD robustness benchmark for web-scale models along with human performance results.
Abstract: Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.
Lay Summary: To make computer vision models more reliable in the real world, we want them to handle new situations that they haven’t seen during their training phase. In the past, researchers used special test sets to check how well models handle such situations. For example, models trained to recognize objects in normal images might fail on blurry ones, so we use test sets like ImageNet-C (for “corruptions”) to evaluate this. But modern models are trained on huge collections of internet images, which already include many of those cases — making the old tests too easy. So, we built a new test set called LAION-C, with fresh, carefully designed images that today’s models haven’t already seen. When we tested top-performing AI models — including powerful systems like GPT-4o — we found that LAION-C was much harder for them. We also tested humans on the dataset and found that today’s best models are now approaching or even beating human-level performance on these difficult tasks. LAION-C offers a new and effective way to test how robust AI vision systems truly are.
Link To Code: https://github.com/FanfeiLi/LAION-C
Primary Area: Deep Learning->Robustness
Keywords: OOD, representation learning, benchmark, model evaluation, vision, classification, psychophysics
Flagged For Ethics Review: true
Submission Number: 11430
Loading