Character-Level Perturbations Amplify LLM Jailbreak Attacks

ICLR 2026 Conference Submission24863 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, tokenization vulnerability, character-level perturbations, jailbreak attacks, safety mechanisms
Abstract: Contemporary large language models (LLMs) exhibit remarkable capabilities, yet their subword tokenization mechanisms suffer from a vulnerability, whereby small character-level perturbations can re-partition text into unfamiliar subwords, degrading model performance across various tasks. Building on this, we show that this tokenization vulnerability also compromises safety mechanisms in jailbreak scenarios. We introduce a simple, model- and template-agnostic character-level jailbreak method and demonstrate that minimal character-level perturbations effectively increase the success rates of both simple and complex jailbreak attacks across multiple LLMs. We reveal that these perturbations lead to over-fragmented tokenization and token representation drift, resulting in substantial divergence in the semantic representations of words. Furthermore, our analysis using word-level semantic recovery and sentence-level spelling error detection and correction shows that models struggle to reconstruct the original semantics for perturbed content. In addition, layer-wise probe classifiers also fail to reliably detect the harmful intent of perturbed jailbreak prompts, further exposing the models' vulnerability in comprehending adversarially perturbed input. Finally, we find that in certain cases, perturbations reduce rather than increase attack success, as the corrupted spans fit less naturally into the template. Together, our findings demonstrate that tokenization-induced vulnerabilities compromise safety mechanisms, underscoring the need for investigation into mitigation strategies.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24863
Loading