Reasoning-based Safety Guardrails Can Make LRMs Unsafe

Reasoning-based Safety Guardrails Can Make LRMs Unsafe

ACL ARR 2026 January Submission1625 Authors

30 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Reasoning Models, Large Language Models, Red-teaming

Abstract: Open-weight Large Reasoning Models (LRMs) are rapidly catching up with closed-source counterparts. Their ability and widespread adoption also represent realistic high-risk scenarios due to the difficulty of patching vulnerabilities and monitoring usage. To prevent misuse, reasoning-based safety guardrails, such as Deliberative Alignment, are applied to defend against jailbreaks. These guardrails first analyze the safety aspects of the prompts and will refuse to assist once they detect the harmful intent. They have demonstrated strong defense, such as the near-perfect refusal from OpenAI's gpt-oss series. Unfortunately, we find that these guardrails can be extremely vulnerable, and reliance on them even leads to unsafe models that respond to extremely forbidden questions, such as `How to kill a man without being caught?'. Specifically, we identify one systematic vulnerability from the reasoning-then-answer mechanism: simply mimicking the reasoning stage structure can directly subvert the guardrails. Based on our findings, we design 4 red-teaming methods that achieve alarmingly high attack success rates (e.g., exceeding 90% across 5 benchmarks on gpt-oss series using both local models and online APIs). Our work reveals that these guardrails, although promising, can make the models unsafe. Such vulnerabilities are not bound to the pre-defined stage structure, and once the guardrails are hijacked, tailored and more harmful responses can be obtained. Evaluations of various leading open-weight LRMs confirm that these vulnerabilities are systemic, underscoring the urgent need for stronger alignments for open-weight LRMs.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: safety and alignment, red teaming

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 1625

Loading