BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Yunhan Zhao; Xiang Zheng; Lin Luo; Yige Li; Xingjun Ma; Yu-Gang Jiang

BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Yunhan Zhao, Xiang Zheng, Lin Luo, Yige Li, Xingjun Ma, Yu-Gang Jiang

Published: 22 Jan 2025, Last Modified: 02 May 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Jailbreak Defense, Blue-Teaming, Large Vision-Language Model

TL;DR: In this work, we focus on black-box defense against VLM jailbreaks and propose a novel defense framework dubbed BlueSuffix to achieve multimodal robustness by training a blue-team suffix generator using reinforcement learning (RL).

Abstract: In this paper, we focus on black-box defense for VLMs against jailbreak attacks. Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends target VLMs against jailbreak attacks without compromising its performance under black-box setting. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator using reinforcement fine-tuning for enhancing cross-modal robustness. We empirically show on four VLMs (LLaVA, MiniGPT-4, InstructionBLIP, and Gemini) and four safety benchmarks (Harmful Instruction, AdvBench, MM-SafetyBench, and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks. Code is available at https://github.com/Vinsonzyh/BlueSuffix.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1748

Loading