GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

Haibo Jin; Ruoxi Chen; Andy Zhou; Yang Zhang; Haohan Wang

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

Haibo Jin, Ruoxi Chen, Andy Zhou, Yang Zhang, Haohan Wang

Published: 04 Mar 2024, Last Modified: 14 Apr 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, safety, jailbreaking, red-teaming

Abstract: Large Language Models (LLMs) face significant challenges with ``jailbreaks" — specially crafted prompts designed to bypass safety filters and induce safety measures. In response, researchers have focused on developing comprehensive testing protocols, to generate a wide array of potential jailbreaks efficiently. In this paper, we propose a role-playing system, namely GUARD (Guideline Upholding through Adaptive Role-play Diagnostics), which can automatically follow the government-issued guidelines to generate jailbreaks to test whether LLMs follow the guidelines accordingly. GUARD works by assigning four different roles to LLMs to collaborate jailbreaks, in the style of the human generation. We have empirically validated the effectiveness of GUARD on three cutting-edge open-sourced LLMs (Vicuna-13B, LongChat-7B, and Llama-2-7B), as well as a widely-utilized commercial LLM (ChatGPT). Moreover, our work extends to the realm of vision-language models (MiniGPT-v2 and Gemini Vision Pro), showcasing GUARD's versatility and contributing valuable insights for the development of safer, more reliable LLM-based applications across diverse modalities.

Submission Number: 8

Loading