Quack: Automatic Jailbreaking Large Language Models via Role-playing

Haibo Jin; Ruoxi Chen; Jinyin Chen; Haohan Wang

Quack: Automatic Jailbreaking Large Language Models via Role-playing

Haibo Jin, Ruoxi Chen, Jinyin Chen, Haohan Wang

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: pdf

Primary Area: general machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Large Language Models, Jailbreak, Testing

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Large Language Models (LLMs) excel in Natural Language Processing (NLP) with human-like text generation, but the misuse of them has raised public concern and prompted the need for safety measures. Proactive testing with jailbreaks, meticulously crafted prompts that bypass model constraints and policies, has become mainstream to ensure security and reliability upon model release. While researchers have made substantial efforts to explore jailbreaks against LLMs, existing methods still face the following disadvantages: (1) require human labor and expertise to design question prompts; (2) non-determination regarding reproducing jailbreak; (3) exhibit limited effectiveness on updated model versions and lack the ability for iterative reuse when invalid. To address these challenges, we introduce Quack, an automated testing framework based on role-playing of LLMs. Quack translates testing guidelines into question prompts, instead of human expertise and labor. It systematically analyzes and consolidates successful jailbreaks into a paradigm featuring eight distinct characteristics. Based on it, we reconstruct and maintain existing jailbreaks through knowledge graphs, which serve as Quack's repository of playing scenarios. It assigns four distinct roles to LLMs, for automatically organizing, evaluating, and further updating jailbreaks. We empirically demonstrate the effectiveness of our method on three state-of-the-art open-sourced LLMs (Vicuna-13B, LongChat-7B, and LLaMa-7B), as well as one widely-used commercial LLM (ChatGPT). Our work addresses the pressing need for LLM security and contributes valuable insights for creating safer LLM-empowered applications.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7894

Loading