The Side Effects of Being Smart: Safety Risks in MLLMs’ Multi-Image Reasoning

The Side Effects of Being Smart: Safety Risks in MLLMs’ Multi-Image Reasoning

ACL ARR 2026 January Submission10027 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Safety, Multimodal Large Language Models, Reasoning Robustness, Benchmark

Abstract: As Multimodal Large Language Models (MLLMs) acquire stronger reasoning capabilities to handle complex, multi-image instructions, this advancement may pose new safety risks. We study this problem by introducing MIR-SafetyBench, the first benchmark focused on multi-image reasoning safety, which consists of 2{,}676 instances across a taxonomy of 9 multi-image relations. Our extensive evaluations on 19 MLLMs reveal a troubling trend: models with more advanced multi-image reasoning can be more vulnerable on MIR-SafetyBench. Beyond attack success rates, we find that many responses labeled as safe are superficial, often driven by misunderstanding or evasive, non-committal replies. We further observe that unsafe generations exhibit lower attention entropy than safe ones on average. This internal signature suggests a possible risk that models may over-focus on task solving while neglecting safety constraints.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: Resources and Evaluation, Language Modeling, Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 10027

Loading