Refusal Degrades with Token-Form Drift: Limits of Token-Level Alignment

ICLR 2026 Conference Submission21055 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Token-form drift, Safety alignment, Adversarial input perturbations
Abstract: Safety alignment of large language models (LLMs) is typically learned through supervised fine-tuning and preference optimization on a fixed distribution of token sequences. We show that this process couples refusal behavior to token form, making alignment fragile under token-form drift—semantics-preserving shifts in orthography, delimiters, substitutions, or segmentation. In controlled perturbation studies, we observe a universal rise–plateau–collapse pattern: refusals degrade as distributional divergence increases, harmful compliance peaks, and extreme shifts collapse into incoherence rather than recovered safety. To scale beyond handcrafted substitutions, we develop an LLM-in-the-loop perturbation framework that automatically discovers diverse, readable adversarial forms. Cross-form evaluation reveals a capability–vulnerability tradeoff: larger models resist low-level shifts longer, yet admit more effective perturbations over broader ranges, exposing wider attack surfaces. A patch-then-break study further shows that fine-tuning against one perturbation form does not transfer, as new effective forms re-emerge rapidly. These results demonstrate that current alignment remains token-level and form-sensitive, motivating future defenses that target semantics directly through form-invariant training, normalization, and cross-form robustness evaluation.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21055
Loading