Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions

ICLR 2026 Conference Submission16665 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language models, Jailbreak Attack, Reasoning Complexity
Abstract: Large language models (LLMs) have gained widespread recognition for their superior performance and have been deployed across numerous domains. Building on Chain-of-Thought (CoT) ideology, Large Reasoning models (LRMs) further exhibit strong reasoning skills, enabling them to infer more accurately and respond appropriately. However, strong general reasoning capabilities do not guarantee a safety response to jailbreak instructions requiring even more robust reasoning capabilities. A model with strong general reasoning capabilities but lacking corresponding safety capabilities can create serious vulnerabilities in the real application. Therefore, a comprehensive benchmark needs to be established to evaluate the safety performance of the model in the face of instructions of different reasoning complexity, which can provide a new dimension of the safety boundaries of the LLMs. This paper quantifies "Reasoning Complexity" as an evaluable safety dimension and categorizes 15 jailbreak attack methods into three different levels according to the reasoning complexity, establishing a hierarchical Chinese-English jailbreak safety benchmark for systematically evaluating the safety performance of LLMs. Meanwhile, to fully consider reasoning complexity brought by unique language characteristics, we first propose some Chinese jailbreak attack methods, including the Chinese Character Disassembly attack, Lantern Riddle attack, and Acrostic Poem attack. A series of experiments indicate that current LLMs and LRMs show different safety boundaries under different reasoning complexity, which provides a new perspective to develop safer LLMs and LRMs
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 16665
Loading