Keywords: Code-mixing, LLM, Jailbreaking
Abstract: Large language models (LLMs) continue to be demonstrably unsafe despite sophisticated safety alignment techniques and multilingual red-teaming. However, recent red-teaming work has focused on incremental gains in attack success over identifying underlying architectural vulnerabilities in models. In this work, we present \textbf{CMP-RT}, a novel red-teaming probe that combines code-mixing with phonetic perturbuations (CMP), exposing a tokenizer-level safety vulnerability in transformers. Combining realistic elements from digital communication such as code-mixing and textese, CMP-RT preserves phonetics while perturbing safety-critical tokens, allowing harmful prompts to bypass alignment mechanisms while maintaining high prompt interpretability, exposing a gap between pre-training and safety alignment. Our results demonstrate robustness gainst standard defenses, attack scalability, and generalisation of the vulnerability across modalities and to SOTA models like Gemini-3-Pro, establishing CMP-RT as a major threat model, and highlighting tokenization as an under-examined vulnerability in current safety pipelines.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: red teaming, jailbreaking, multilingual, multimodal, LLM safety
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models
Languages Studied: English, Hindi
Submission Number: 9665
Loading