PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Safety Alignment; Vision Language Model; Reasoning
TL;DR: This paper introduces PRISM, a framework that teaches VLMs a structured reasoning process to identify harmful intention, which highlights the critical trade-off between safety and utility in VLM alignment.
Abstract: Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduce **PRISM** (**P**rincipled **R**easoning for **I**ntegrated **S**afety in **M**ultimodality), a system2-like framework that aligns VLMs by embedding a structured, safety-aware reasoning process. Our framework consists of two key components: PRISM-CoT, a dataset that teaches safety-aware chain-of-thought reasoning, and PRISM-DPO, generated via Monte Carlo Tree Search (MCTS) to further refine this reasoning through Direct Preference Optimization to help obtain a delicate safety boundary. Comprehensive evaluations demonstrate PRISM's effectiveness, achieving remarkably low attack success rates including 0.15% on JailbreakV-28K for Qwen2-VL and 90% improvement over the previous best method on VLBreak for LLaVA-1.5. PRISM also exhibits strong robustness against adaptive attacks, significantly increasing computational costs for adversaries, and generalizes effectively to out-of-distribution challenges, reducing attack success rates to just 8.70\% on the challenging multi-image MIS benchmark. Remarkably, this robust defense is achieved while preserving model utility.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 5937
Loading