SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning

SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning

ACL ARR 2025 February Submission1749 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As Multimodal Large Language Models (MLLMs) develop, their potential security issues have become increasingly prominent. **M**achine **U**nlearning (MU), as an effective strategy for forgetting specific knowledge in training data, has been widely used in privacy protection. However, _MU for safety in MLLM has yet to be fully explored_. To address this issue, we propose \dataset, a safety unlearning benchmark for MLLMs, consisting of 3,000 images and 28.8K VQA pairs. We comprehensively evaluate unlearning methods from two perspectives: **_forget quality_** and **_model utility_**. Our findings show that existing MU methods struggle to maintain model performance while implementing the forget operation and often suffer from **_over-forgetting_**. Hence, we introduce **P**rompt **D**ecouple (PD) Loss to alleviate over-forgetting through decouple prompt during unlearning process. To quantitatively measure over-forgetting mitigated by PD Loss, we propose a new metric called **S**afe **A**nswer **R**efusal **R**ate (SARR). Experimental results demonstrate that combining PD Loss with existing unlearning methods can effectively prevent over-forgetting and achieve a decrease of 79.5% in the SARR metric of LLaVA-7B and LLaVA-13B, while maintaining forget quality and model utility. Our code and dataset will be released upon acceptance. **Warning: This paper contains examples of harmful language and images, and reader discretion is recommended.**

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Multimodal Large Language Model, Machine Unlearning, Safety

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 1749

Loading