JOINTMMSAFE: A Combinatorial Safety Benchmark for Multimodal Foundation Models

Shruti Palaskar; Leon Alexander Gatys; Mona Abdelrahman; Mar Jacobo; Laurence F Lindsey; Gunnar Lund; Yang Xu; Navid Shiee; Jeffrey P. Bigham; Charles Maalouf; Joseph Yitan Cheng

JOINTMMSAFE: A Combinatorial Safety Benchmark for Multimodal Foundation Models

Shruti Palaskar, Leon Alexander Gatys, Mona Abdelrahman, Mar Jacobo, Laurence F Lindsey, Gunnar Lund, Yang Xu, Navid Shiee, Jeffrey P. Bigham, Charles Maalouf, Joseph Yitan Cheng

Published: 24 Sept 2025, Last Modified: 07 Oct 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Alignment, Multimodal Safety, LLM Evaluation, Benchmark

TL;DR: We expose critical gaps in multimodal AI safety by showing that models excel at obvious unsafe content but fail dramatically on joint vision-language reasoning

Abstract: Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal. We present a comprehensive framework introducing borderline severity level alongside safe and unsafe levels, enabling fine-grained evaluations of joint image–text safety combinations. Using a multi-step, context-driven synthetic pipeline conditioned on real-world images, we construct JointMMSafe, a large-scale human-graded benchmark for evaluation across structured multimodal severity combinations. Evaluations reveal systematic joint understanding failures: while models excel when clear safety signals exist in individual modalities (90\%+ accuracy), performance degrades consistently when joint multimodal understanding is required—scenarios where safety emerges only through combined image-text interpretation. Furthermore, borderline content exposes significant alignment instability—refusal rates varying dramatically from 62.4\% to 10.4\% for identical content based solely on instruction framing, with this instability leading to concerning under-refusal of unsafe content (only 53.9\%). Our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models highlighting the need for research on robust vision–language safety.

Submission Number: 154

Loading