Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
Keywords: Large Reasoning Models, Safety, Activation Steering
TL;DR: We show that large reasoning models can hide safety failures in their reasoning traces and proposes adaptive multi-principle steering to detect and mitigate them.
Abstract: Large reasoning models (LRMs) increasingly expose chain-of-thought-like intermediate reasoning for transparency, verification, and deliberate problem solving. This creates a safety blind spot: harmful or policy-violating content may appear in the reasoning trace even when the final answer appears safe. We test whether final-answer safety is a sufficient proxy for the full reasoning-answer trajectory by scoring both stages under a unified twenty-principle safety rubric. Using prompts from seven public harmfulness and jailbreak sources, plus four out-of-distribution (OOD) sources for robustness evaluation, we evaluate 15 open-weight and API-based LRMs. Across 41K prompts per model, reasoning traces consistently expose additional safety risk beyond final answers. The effect is systematic and appears most clearly in high-severity stage-wise failures: **leak** cases, where unsafe reasoning precedes a safe-looking answer, and **escape** cases, where benign-looking reasoning precedes an unsafe final response. Principle-level analysis shows that risk concentrates in categories such as misinformation, legal compliance, discrimination, physical harm, and psychological harm. Beyond diagnosis, we propose **adaptive multi-principle steering**, a white-box test-time mitigation that learns one unsafe-to-safe activation direction per safety principle and activates only directions whose current hidden state is closer to the unsafe centroid than to the safe centroid. On three steerable open reasoning models, adaptive steering consistently reduces unsafe counts in both reasoning traces and final answers on held-out and OOD benchmarks. The strongest gains reduce unsafe reasoning by 77.2% on HeldOut2K and 62.7% on OOD2K, and reduce unsafe final responses by up to 48.1% on OOD2K. DeepSeek-R1-Qwen-7B achieves a 40.8% average unsafe-count reduction while retaining 97.7% of macro-averaged accuracy on BBH, GSM8K, and MMLU. These results suggest that LRM safety should be evaluated and mitigated over the full exposed reasoning-answer trajectory, not only at the final-answer stage.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 260
Loading