Keywords: safety evaluation benchmarks, multimodal large language models, modality combinations
TL;DR: We present the Unified Safety Benchmark (USB) for MLLMs, which provides 4 distinct modality combinations for each of the 61 risk sub-categories, across both vulnerability and over-refusal dimensions.
Abstract: Despite their remarkable achievements and widespread adoption, Multimodal Large Language Models (MLLMs) have revealed significant vulnerabilities, highlighting the urgent need for robust safety evaluation benchmarks. However, the limited scope, scale, effectiveness, and consideration of multimodal risks in existing MLLM safety benchmarks yield inflated and contradictory results, hindering the effective discovery and management of vulnerabilities. In this paper, to address these shortcomings, we introduce Unified Safety Benchmark (USB), which is one of the most comprehensive evaluation benchmarks in MLLM safety. Our benchmark features extensive risk categories, comprehensive modality combinations, diverse and effective queries, and encompasses both vulnerability and over-refusal evaluations. From the perspective of two key dimensions: risk categories and modality combinations, we demonstrate that the available benchmarks—even the union of the vast majority of them—are far from being truly comprehensive. To bridge this gap, we design a sophisticated data synthesis pipeline that generates extensive and efficient complementary data addressing previously unexplored aspects. By combining open-source datasets with our synthetic data, our benchmark provides 4 distinct modality combinations for each of the 61 risk sub-categories. Furthermore, beyond evaluating vulnerability to harmful queries, we pioneer the simultaneous assessment of model over-refusal to benign inputs. Extensive experimental results, conducted across 12 mainstream open-source MLLMs and 5 closed-source commercial MLLMs, demonstrates that existing MLLMs still struggle with the trade-off between avoiding vulnerabilities and over-refusal, and are more vulnerable to image-only risky or cross-modal risky inputs, highlighting the need for refined safety mechanisms. Warning: This paper contains unfiltered and potentially harmful content that may be offensive.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 10209
Loading