VLM-RobustBench: A Robustness Benchmark for Vision-Language Models

Rohit Saxena; Alessandro Suglia; Pasquale Minervini

VLM-RobustBench: A Robustness Benchmark for Vision-Language Models

Rohit Saxena, Alessandro Suglia, Pasquale Minervini

Published: 02 Mar 2026, Last Modified: 12 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0

Track: long paper (up to 8 pages)

Keywords: vision-language models, robustness

Abstract: Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still don't fully understand how they perform under real-world image distortions. We present **VLM-RobustBench**, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severity *glass_blur* reduces MMBench accuracy by about 8pp on average across models, while the largest drops arise from resampling and geometric distortions (e.g., *upsample*, *elastic\_transform*, reaching up to 34pp.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 74

Loading