Abstract: Large language models (LLMs) excel in many tasks, but their safety guarantees vary by language, e.g., responses in English tend to be safer than those in low-resource languages. This inconsistency creates a vulnerability, since an attacker can circumvent safety measures by using a less-supported language as an intermediary, even without fluency in that language. Traditional solutions rely on multilingual safety alignment, which demands vast, per-language datasets and introduces significant trade-offs between usefulness and safety (the so-called ``alignment tax'').
To overcome these limitations, we introduce \emph{English as Defense Proxy (E-Proxy)}, a unified approach that leverages English, usually the advantage language of LLMs, as a universal safety anchor. During multilingual training, E-Proxy uses English jailbreak prompts to extract the model’s existing safety knowledge, then applies simple language-mapping prompts (e.g., “Please answer in \{target language\}”) to transfer that knowledge across languages. Our analysis shows that formulating prompts in a high-resource language preserves the model’s utility, while enforcing responses in the target language significantly enhances safety. We evaluate E-Proxy on extensive benchmarks of both attack resistance and task performance. On the MultiJail benchmark, E-Proxy blocks over 99 \% of jailbreak attempts while retaining 95 \% of average task performance, all with a simply constructed multilingual alignment data.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Interpretability and Analysis of Models for NLP, Language Modeling, Multilingualism and Cross-Lingual NLP
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings
Languages Studied: English
Submission Number: 2404
Loading