Keywords: Multilingual Jailbreak Attack, Suffix Attack, Adversarial Attack, AI Safety
TL;DR: We extend the GCG and genetic algorithm (AutoDAN) methods to generate multilingual jailbreak prompts
Abstract: Aligned Large Language Models (LLMs) are powerful decision-making tools that are created through extensive alignment with human feedback and capable of multilingual language understanding. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit harmful outputs that should not be given by aligned LLMs. Automated multilingual jailbreak prompts could increase the evasion of content moderation and create more challenges for multilingual alignments. Investigating multilingual jailbreak prompts can lead us to delve into the limitations of LLMs and guide us to secure them from multilingual attacks. The past research efforts focused on the generation of English jailbreak prompts such as the work on GCG (Zou et al., 2023) and AutoDAN (Liu et al., 2024) methods. The existing research on multilingual jailbreaks employed either handcrafted multilingual jailbreak prompts or ones directly translated from English jailbreak prompts. In this paper, we introduce two methods, namely Multilingual GCG and Multilingual AutoDAN, to automate the generation of multilingual jailbreak prompts. Moreover, this paper proposes a novel graph-based method to further automate the multilingual jailbreak attack against aligned LLMs and increase the attack successful rate (ASR). In this graph-based method, the adversaries will traverse a graph consisting of nodes with different languages, and automatically generate and evaluate multilingual prompts. The resulting multilingual jailbreak prompts effectively elicit harmful outputs from popular open source LLMs such as Mistra-v0.3, Llama-3.1, and Qwen-2.5. Interestingly, the success rate of multilingual jailbreak attacks is much higher than the baseline in Multilingual GCG and Multilingual AutoDAN also achieved high ASRs with long multilingual jailbreak prompts. In total, this work significantly advances the work in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing harmful information from multilingual prompts.
Submission Number: 51
Loading