Defend against Jailbreak Attacks via Debate with Partially Perceptive Agents

28 Sept 2024 (modified: 14 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-agent Debate; Defense; Visual Large Language Models
Abstract: Recent studies have shown that maliciously injecting or perturbing the input image in Vision Large Language Models (VLMs) can lead to jailbreak attacks, raising significant security concerns. A straightforward defense strategy against such attacks is to crop the input image, thereby disrupting the effectiveness of the injection or perturbation. However, the cropping can significantly distort the semantics of the input image, leading to an adverse impact on the model's output when processing clean input. To mitigate the adverse impact, we propose a defense mechanism against jailbreak attacks based on a multi-agent debate approach. In this method, one agent (“integrated” agent) accesses the full integrated image, while the other (“partial” agent) only accesses cropped/partial images, aiming to avoid the attack while preserving the correct semantics in the output as much as possible. Our key insight is that when an integrated agent debates with a partial agent, if the integrated agent receives clean input, it can successfully persuade the partial agent. Conversely, if the integrated agent is given an attacked input, the partial agent can persuade it to rethink the original output, thereby achieving effective defense against the attack. Empirical experiments have demonstrated that our method provides more effective defense compared to the baseline method, successfully reducing the average attack success rate from 100% to 22%. In more advanced experimental setups, our proposed method can even limit the average attack success rate to 18% (debating with GPT-4o) and 14% (with enhanced perspective).
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 14205
Loading