Hiding in Plain Sight: A Steganographic Approach to Stealthy LLM Jailbreaks

15 Sept 2025 (modified: 11 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models; Jailbreak Attacks; Information Hiding
TL;DR: We introduce StegoAttack, a novel and fully stealthy jailbreak method that bypasses LLM safety defenses by hiding malicious instructions within benign, natural text, making the attacks both successful and difficult to detect.
Abstract: Jailbreak attacks pose a serious threat to Large Language Models (LLMs) by bypassing their safety mechanisms. A truly advanced jailbreak is defined not only by its effectiveness but, more critically, by its stealthiness. However, existing methods face a fundamental trade-off between *semantic stealth* (hiding malicious intent) and *linguistic stealth* (appearing natural), leaving them vulnerable to detection. To resolve this trade-off, we propose *StegoAttack*, a framework that leverages steganography. The core insight is to embed a harmful query within a benign, semantically coherent paragraph. This design provides semantic stealth by concealing the existence of malicious content, and ensures linguistic stealth by maintaining the natural fluency of the cover paragraph. We evaluate StegoAttack on four state-of-the-art, safety-aligned LLMs, including OpenAI-o3 and DeepSeek-R1, and benchmark it against eight leading jailbreak methods. Our results show that StegoAttack achieves an average attack success rate (ASR) of 92.00%, outperforming the strongest baseline by 11.00%. Critically, its ASR drops by less than 1.00% under external detection, demonstrating an unprecedented combination of high efficacy and exceptional stealth. The code is available at https://anonymous.4open.science/r/StegoAttack-Jail66
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 5441
Loading