Keywords: Large Reasoning Models, Large Language Models, Red-teaming
Abstract: Open-weight Large Reasoning Models (LRMs) are rapidly catching up with closed-source counterparts. Their ability and widespread adoption also represent realistic high-risk scenarios due to the difficulty of patching vulnerabilities and monitoring usage. To prevent misuse, reasoning-based safety guardrails, such as Deliberative Alignment, are applied to defend against jailbreaks. These guardrails first analyze the safety aspects of the prompts and will refuse to assist once they detect the harmful intent. They have demonstrated strong defense, such as the near-perfect refusal from OpenAI's gpt-oss series. Unfortunately, we find that these guardrails can be extremely vulnerable, and reliance on them even leads to unsafe models that respond to extremely forbidden questions, such as `How to kill a man without being caught?'. Specifically, we identify one systematic vulnerability from the reasoning-then-answer mechanism: simply mimicking the reasoning stage structure can directly subvert the guardrails. Based on our findings, we design 4 red-teaming methods that achieve alarmingly high attack success rates (e.g., exceeding 90% across 5 benchmarks on gpt-oss series using both local models and online APIs). Our work reveals that these guardrails, although promising, can make the models unsafe. Such vulnerabilities are not bound to the pre-defined stage structure, and once the guardrails are hijacked, tailored and more harmful responses can be obtained. Evaluations of various leading open-weight LRMs confirm that these vulnerabilities are systemic, underscoring the urgent need for stronger alignments for open-weight LRMs.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment, red teaming
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 1625
Loading