Abstract: Recently, Large Reasoning Models (LRMs) have demonstrated superior logical capabilities compared to traditional Large Language Models (LLMs), gaining significant attention. Despite their impressive performance, the potential for stronger reasoning abilities to introduce more severe security vulnerabilities remains largely underexplored. Existing jailbreak methods often struggle to balance effectiveness with robustness against adaptive safety mechanisms. In this work, we propose SEAL, a novel jailbreak attack that targets LRMs through an adaptive encryption
pipeline designed to override their reasoning
processes and evade potential adaptive align-
ment. Specifically, SEAL introduces a stacked
encryption approach that combines multiple
ciphers to overwhelm the model’s reason-
ing capabilities, effectively bypassing built-in
safety mechanisms. To further prevent LRMs
from developing countermeasures, we incor-
porate two dynamic strategies—random and
adaptive—that adjust the cipher length,
order, and combination. Extensive experi-
ments on real-world reasoning models, includ-
ing DeepSeek-R1, Claude Sonnet, and Ope-
nAI GPT-o4, validate the effectiveness of our
approach. Notably, SEAL achieves an attack
success rate of 80.8% on GPT o4-mini, outper-
forming state-of-the-art baselines by a signifi-
cant margin of 27.2%. Warning: This paper
contains examples of inappropriate, offensive,
and harmful content.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: chain-of-thought, safety and alignment, red teaming, robustness, transfer, adversarial attacks
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7391
Loading