Abstract: Recent studies reveal that integrating new modalities into large language models (LLMs), such as vision-language models (VLMs), creates a new attack surface that bypasses existing safety training techniques like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where
all input modalities are ultimately fused into the language space,
we explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment.
Our empirical evaluation across seven datasets demonstrates promising transferability --- textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8% and in some cases, even as low as nearly 2% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality
Languages Studied: English
Submission Number: 1804
Loading