Defending Jailbreak Attack in VLMs via Cross-modality Information Detector

Defending Jailbreak Attack in VLMs via Cross-modality Information Detector

ACL ARR 2024 June Submission4296 Authors

16 Jun 2024 (modified: 13 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Vision Language Models (VLMs) extend the capacity of LLMs to comprehensively understand vision information, achieving remarkable performance in many vision-centric tasks. Despite that, recent studies have shown that these models are susceptible to jailbreak attacks, which refer to an exploitative technique where malicious users can break the safety alignment of the target model and generate misleading and harmful answers. This potential threat is caused by both the inherent vulnerabilities of LLM and the larger attack scope introduced by vision input. To enhance the security of VLMs against jailbreak attacks, researchers have developed various defense techniques. However, these methods either require modifications to the model's internal structure or demand significant computational resources during the inference phase. Multimodal information is a double-edged sword. While it increases the risk of attacks, it also provides additional data that can enhance safeguards. Inspired by this, we propose $\underline{\textbf{C}}$ross-modality $\underline{\textbf{I}}$nformation $\underline{\textbf{DE}}$tecto$\underline{\textbf{R}}$ ($\textit{CIDER})$, a plug-and-play jailbreaking detector designed to identify maliciously perturbed image inputs, utilizing the cross-modal similarity between harmful queries and adversarial images. This simple yet effective cross-modality information detector, $\textit{CIDER}$, is independent of the target VLMs and requires less computation cost. Extensive experimental results demonstrate the effectiveness and efficiency of $\textit{CIDER}$, as well as its transferability to both white-box and black-box VLMs.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: security and privacy, red teaming, applications, robustness

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 4296

Loading