Keywords: MLRM, safety, alignment, safety-helpfulness trade-off
Abstract: Multimodal Large Reasoning Models (MLRMs) have exhibited remarkable capabilities in complex multimodal tasks.
However, our findings reveal a critical trade-off: reasoning-based models are more prone to generating harmful content, leading to degradation in safety performance.
This paper presents a large-scale analysis of this safety–reasoning trade-off, identifying two main mechanisms of safety degradation: (i) visual attention drift, which reduces the model’s reliance on visual grounding and thereby exacerbates overlooked risks in cross-modal interactions; (ii) unsafe reasoning patterns, including flawed reasoning initiation and chain-of-thought safety attenuation, which compromise the model’s safety awareness.
To mitigate these issues, we propose **P**olicy-guided **S**afety **T**uning (**PST**), a two-stage alignment framework. It first employs *Policy-Guided Supervised Fine-Tuning* to integrate explicit safety policies into the reasoning process, establishing a structured and interpretable foundation for safe decision-making.
Then, PST applies *Safety Reasoning Preference Optimization* to encourage the model to generate safe, helpful, and informative responses while reducing oversensitive and homogeneous characteristics.
Extensive experiments demonstrate that PST significantly reduces harmful outputs across multiple multimodal safety benchmarks, while maintaining competitive performance on general tasks.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 1280
Loading