SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Safety, Multimodal Learning, Reinfocement Learning
TL;DR: We propose SaFeR-VLM, a safety-aligned reinforcement learning framework that integrates safety into the reasoning process of multimodal large models, achieving SOTA safety and helpfulness without sacrificing performance.
Abstract: Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the *Reasoning Tax*. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose **SaFeR-VLM**, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance **70.13** and **78.97** on safety and helpfulness across six benchmarks, surpassing both same-scale and >10× larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by **6.47** and **16.76** points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at [https://anonymous.4open.science/r/ICLR2026-5065](https://anonymous.4open.science/r/ICLR2026-5065).
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 5064
Loading