Teach to Reason Safely: Policy-Guided Safety Tuning for MLRMs

Teach to Reason Safely: Policy-Guided Safety Tuning for MLRMs

ICLR 2026 Conference Submission1280 Authors

03 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: MLRM, safety, alignment, safety-helpfulness trade-off

Abstract: Multimodal Large Reasoning Models (MLRMs) have exhibited remarkable capabilities in complex multimodal tasks. However, our findings reveal a critical trade-off: reasoning-based models are more prone to generating harmful content, leading to degradation in safety performance. This paper presents a large-scale analysis of this safety–reasoning trade-off, identifying two main mechanisms of safety degradation: (i) visual attention drift, which reduces the model’s reliance on visual grounding and thereby exacerbates overlooked risks in cross-modal interactions; (ii) unsafe reasoning patterns, including flawed reasoning initiation and chain-of-thought safety attenuation, which compromise the model’s safety awareness. To mitigate these issues, we propose **P**olicy-guided **S**afety **T**uning (**PST**), a two-stage alignment framework. It first employs *Policy-Guided Supervised Fine-Tuning* to integrate explicit safety policies into the reasoning process, establishing a structured and interpretable foundation for safe decision-making. Then, PST applies *Safety Reasoning Preference Optimization* to encourage the model to generate safe, helpful, and informative responses while reducing oversensitive and homogeneous characteristics. Extensive experiments demonstrate that PST significantly reduces harmful outputs across multiple multimodal safety benchmarks, while maintaining competitive performance on general tasks.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 1280

Loading