Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers

Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers

ACL ARR 2025 May Submission7391 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recently, Large Reasoning Models (LRMs) have demonstrated superior logical capabilities compared to traditional Large Language Models (LLMs), gaining significant attention. Despite their impressive performance, the potential for stronger reasoning abilities to introduce more severe security vulnerabilities remains largely underexplored. Existing jailbreak methods often struggle to balance effectiveness with robustness against adaptive safety mechanisms. In this work, we propose SEAL, a novel jailbreak attack that targets LRMs through an adaptive encryption pipeline designed to override their reasoning processes and evade potential adaptive align- ment. Specifically, SEAL introduces a stacked encryption approach that combines multiple ciphers to overwhelm the model’s reason- ing capabilities, effectively bypassing built-in safety mechanisms. To further prevent LRMs from developing countermeasures, we incor- porate two dynamic strategies—random and adaptive—that adjust the cipher length, order, and combination. Extensive experi- ments on real-world reasoning models, includ- ing DeepSeek-R1, Claude Sonnet, and Ope- nAI GPT-o4, validate the effectiveness of our approach. Notably, SEAL achieves an attack success rate of 80.8% on GPT o4-mini, outper- forming state-of-the-art baselines by a signifi- cant margin of 27.2%. Warning: This paper contains examples of inappropriate, offensive, and harmful content.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: chain-of-thought, safety and alignment, red teaming, robustness, transfer, adversarial attacks

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 7391

Loading