Keywords: Jailbreak Robustness, low-resource languages, Indic languages, less-resourced languages, resources for less-resourced languages
TL;DR: Indic Jailbreak Robustness (IJR): contracts overestimate safety; Free shows near-universal jailbreaks and strong English to Indic transfer.
Abstract: Safety alignment of large language models (LLMs) is often evaluated in English and under rigid refusal contracts, leaving vulnerabilities in multilingual and script-diverse contexts underexplored. We introduce $\textbf{Indic Jailbreak Robustness (IJR)}$, the first judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (~2.09 billion speakers). IJR covers 42,636 prompts across two tracks: $JSON$ (contract-bound) and $Free$ (naturalistic).
Our findings reveal three consistent patterns. First, contracts inflate conservatism without preventing jailbreaks: in \textsc{JSON}, LLaMA and Sarvam exceed $0.92$ JSR despite high refusal rates, while in ${Free}$ all models reach $\approx$1.0 JSR with refusals collapsing. Second, English$\to$Indic transfer is seamless, both instruction and format wrappers succeed, with \emph{format} often stronger, showing that high-resource adversaries compromise low-resource languages. Third, orthography shifts matter: romanized and mixed inputs typically \emph{reduce} JSR under ${JSON}$, but correlations with romanization share and tokenization features ($\rho\approx0.28$–$0.32$) show systematic effects rather than noise. Human audits (E5) confirm detector reliability, and lite-to-full comparisons (E7) show conclusions hold under reduced evaluation. Taken together, IJR establishes a reproducible, multi-language stress test that uncovers vulnerabilities invisible to English-only, contract-only benchmarks, and highlights unique risks for South Asian users where code-switching, romanization, and cross-lingual prompts are pervasive.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21528
Loading