Keywords: AI Safety, Jailbreak Attacks, Multi-modal large reasoning models
Abstract: Multi-modal large language models (MLLMs) are being increasingly fine-tuned with reinforcement learning (RL) to improve reasoning, yielding strong gains on complex benchmarks. Yet recent studies show that such reasoning-oriented fine-tuning weakens safety alignment, making models far more vulnerable to jailbreak attacks. We trace this vulnerability to a misspecified objective: RL fine-tuning maximizes task accuracy while ignoring safety constraints. To address this, we introduce SafeThink, an inference-time steering method that enforces safety constraints directly within the chain-of-thought. At each reasoning step, SafeThink scores partial traces with a safety reward and, when unsafe content is detected, projects the trajectory back into the safe set via lightweight textual feedback (e.g., ``Wait, think safely’’). This mechanism preserves accuracy on benign inputs while reinstating robustness under adversarial prompts. Our experiments across diverse safety robustness benchmarks demonstrate that SafeThink significantly improves safety without sacrificing reasoning capabilities. For example, against jailbreak attacks on OpenVLThinker-7B, SafeThink reduces the attack success rate by 44.57% compared to the base reasoning model and by 18.32% over the existing baseline.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14235
Loading