PANDORA: Detailed LLM Jailbreaking via Collaborated Phishing Agents with Decomposed Reasoning

Zhaorun Chen; Zhuokai Zhao; Wenjie Qu; Zichen Wen; Zhiguang Han; Zhihong Zhu; Jiaheng Zhang; Huaxiu Yao

PANDORA: Detailed LLM Jailbreaking via Collaborated Phishing Agents with Decomposed Reasoning

Zhaorun Chen, Zhuokai Zhao, Wenjie Qu, Zichen Wen, Zhiguang Han, Zhihong Zhu, Jiaheng Zhang, Huaxiu Yao

Published: 04 Mar 2024, Last Modified: 14 Apr 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Jailbreaking, Trustworthy and Secure Large Language Models, Decomposed Reasoning, LLM Multi-step Reasoning, LLM agents

TL;DR: In this study, we propose PANDORA, a novel method designed for LLMs jailbreaking through collaborated phishing agents with decomposed reasoning.

Abstract: While the breakthrough of large language models (LLMs) has brought significant advancement to the development of natural language processing, it also introduces new vulnerabilities, especially in security and privacy. Jailbreak attacks, a core component of red-teaming LLMs, have been an effective way to better understand and enhance LLMs security, through testing the resilience of existing safety features and simulating real-world attacks. In this paper, we propose **PANDORA**, a novel approach designed for LLMs jailbreaking through collaborated phishing agents with decomposed reasoning. PANDORA uniquely leverages the multi-step reasoning capabilities of the LLMs, decomposing adversarial attacks into stealthier sub-queries to elicit more informative responses. More specifically, it consists of four collaborated sub-modules, where each is tailored to refine the attack strategy dynamically when producing the adversarial response. In addition, we propose two new metrics, **PASS** and **Adv-NER**, to complement the current jailbreaking evaluations with response quality measures that work without ground-truths. Extensive experiments conducted on the AdvBench-subset demonstrate PANDORA's superior performance over existing state-of-the-arts on four major victim models. More notably, even a more efficient, distilled version of the original PANDORA, demonstrates high success rates on LLMs with black-box access such as GPT-4 and GPT-3.5, while requiring much less memory allocation and query iterations than other jailbreak approaches.

Submission Number: 112

Loading