Cross-Modal Attention Guided Unlearning in Vision-Language Models

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language models, unlearning, privacy
TL;DR: We propose a resource-efficient cross-modal attention guided mechanism to facilitate unlearning in vision-language models under practical privacy considerations.
Abstract: The inference abilities of large-scale pretrained models are often attributed to the size of pre-training data collected across several domains. However, these models may memorize private and/or sensitive information during training and regurgitate it in inference. Recently, machine unlearning has been leveraged to address such leakage in LLMs. VLMs add a layer of complexity to this process, as the visual context in the query may additionally contain sensitive information. To address this issue, we explore unlearning for VLMs, specifically for the Visual Question Answering (VQA) task. We explore the role of visual tokens for output generation in VLMs using cross-modal attention and utilize it to formulate Cross-Modal Attention Guided Unlearning (CAGUL), a lightweight and efficient VLM unlearning framework. In contrast to computationally expensive model finetuning methods, CAGUL utilizes external modules to encode unlearning information in visual tokens of low importance for relevant queries. We find that the transformed visual tokens not only prevent leakage but also retain reference model behavior. Experimental results show that our method performs better or on par with finetuning-based baselines without altering the pre-trained model parameters or incurring retraining costs, making it a practical and effective VLM unlearning solution.
Submission Number: 10
Loading