Seeing is Not Always Believing: Undermining Trust in Safety Alignment via Imperceptible Jailbreaks

ACL ARR 2026 January Submission3420 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Imperceptible Jailbreaks
Abstract: To support responsible deployment, open-source LLMs are typically equipped with safety alignment that refuse malicious questions as evidence of alignment. However, if attackers can construct jailbreak prompts that are visually indistinguishable from malicious questions yet consistently bypass these safeguards and can be easily copied and shared together with normal text, the validity of such alignment becomes questionable. In such scenarios, an open-source LLM that fails to reject an apparently unmodified malicious question may be perceived as misaligned, thereby undermining both alignment claims and public trust of model providers. To reveal this risk, we introduce **imperceptible jailbreaks** that exploit a class of Unicode characters called *variation selectors*. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, *while their tokenization is "secretly" altered*. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs, all without producing any visible modifications in the written prompt.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Jailbreak Attacks, Large Language Models
Languages Studied: English
Submission Number: 3420
Loading