Track: long paper (up to 8 pages)
Keywords: vision-language models, robustness
Abstract: Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still don't fully understand how they perform under real-world image distortions.
We present **VLM-RobustBench**, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings.
We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented).
Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions.
In particular, low-severity *glass_blur* reduces MMBench accuracy by about 8pp on average across models, while the largest drops arise from resampling and geometric distortions (e.g., *upsample*, *elastic\_transform*, reaching up to 34pp.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 74
Loading