RED QUEEN: SAFEGUARDING LARGE LANGUAGE MODELS AGAINST CONCEALED MULTI-TURN ATTACK

Yifan Jiang; Kriti Aggarwal; Tanmay Laud; Kashif Munir; Jay Pujara; Subhabrata Mukherjee

RED QUEEN: SAFEGUARDING LARGE LANGUAGE MODELS AGAINST CONCEALED MULTI-TURN ATTACK

Yifan Jiang, Kriti Aggarwal, Tanmay Laud, Kashif Munir, Jay Pujara, Subhabrata Mukherjee

27 Sept 2024 (modified: 09 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Jailbreaking, Large Language Models, Safety Alignment

TL;DR: A new jailbreak attack, RED QUEEN ATTACK, the first work constructing multi-turn scenarios to conceal attackers’ harmful intent, reaching promising results against current LLMs.

Abstract: The rapid progress of large language models (LLMs) has opened up new opportunities across various domains and applications; yet it also presents challenges related to potential misuse. To mitigate such risks, red teaming, a strategy where developers adopt the role of potential attackers has been employed to probe language models and preemptively guard against such harms. Jailbreak attacks are a commonly used red teaming strategy that uses crafted prompts to bypass safety guardrails. However, current jailbreak attack approaches are single-turn, with explicit malicious queries that do not fully capture the complexity of real-world interactions. In reality, users can engage in multi-turn interactions with LLM-based chat assistants, allowing them to conceal their true intentions in a more covert manner. Research on the Theory of Mind (ToM) reveals that LLMs struggle to infer latent intent, making it crucial to investigate how LLMs handle concealed malicious intent within multi-turn scenarios. To bridge this gap, we propose a new jailbreak approach, RED QUEEN ATTACK. This method constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. Next, we craft 40 scenarios that vary in turns and select 14 harmful categories to generate 56k multi-turn attack data points. We conduct comprehensive experiments on the RED QUEEN ATTACK with four representative LLM families of different sizes. Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.6% attack success rate on GPT-4o and 77.1% on Llama3-70B. Further analysis reveals that larger models are more susceptible to the RED QUEEN ATTACK, with multi-turn structures and concealment strategies contributing to its success. To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks. This approach reduces the attack success rate to below 1% while maintaining the model’s performance across standard benchmarks. We release our code and data to support future research.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8646

Loading