Automated Generation of Multilingual Jailbreak Prompts

Jonathan Ding; Will Cai; Khanak Jain; Dhruv Nair; Aditya Naha; Kevin Zhu; Vasu Sharma

Automated Generation of Multilingual Jailbreak Prompts

Jonathan Ding, Will Cai, Khanak Jain, Dhruv Nair, Aditya Naha, Kevin Zhu, Vasu Sharma

Published: 29 Sept 2025, Last Modified: 22 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multilingual Jailbreak Attack, Suffix Attack, Adversarial Attack, AI Safety

TL;DR: We extend the GCG and genetic algorithm (AutoDAN) methods to generate multilingual jailbreak prompts

Abstract: Aligned Large Language Models (LLMs) are powerful decision-making tools that are created through extensive alignment with human feedback and capable of multilingual language understanding. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit harmful outputs that should not be given by aligned LLMs. Automated multilingual jailbreak prompts could increase the evasion of content moderation and create more challenges for multilingual alignments. Investigating multilingual jailbreak prompts can lead us to delve into the limitations of LLMs and guide us to secure them from multilingual attacks. The past research efforts focused on the generation of English jailbreak prompts such as the work on GCG (Zou et al., 2023) and AutoDAN (Liu et al., 2024) methods. The existing research on multilingual jailbreaks employed either handcrafted multilingual jailbreak prompts or ones directly translated from English jailbreak prompts. In this paper, we introduce two methods, namely Multilingual GCG and Multilingual AutoDAN, to automate the generation of multilingual jailbreak prompts. Moreover, this paper proposes a novel graph-based method to further automate the multilingual jailbreak attack against aligned LLMs and increase the attack successful rate (ASR). In this graph-based method, the adversaries will traverse a graph consisting of nodes with different languages, and automatically generate and evaluate multilingual prompts. The resulting multilingual jailbreak prompts effectively elicit harmful outputs from popular open source LLMs such as Mistra-v0.3, Llama-3.1, and Qwen-2.5. Interestingly, the success rate of multilingual jailbreak attacks is much higher than the baseline in Multilingual GCG and Multilingual AutoDAN also achieved high ASRs with long multilingual jailbreak prompts. In total, this work significantly advances the work in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing harmful information from multilingual prompts.

Submission Number: 51

Loading