Keywords: Large Vision Language Models, Data Poisoning Attack, Data Memorization
TL;DR: Data memorization, not visual perturbation, is the primary driver of successful data poisoning attacks in Large Vision–Language Models, and the proposed RejectShield defense reduces attack success by up to 99% by disrupting memorization.
Abstract: **The poison is not the pixels.** Large Vision–Language Models (LVLMs) excel across tasks, yet their safety and security remain underexplored. Among threats, \textit{visual perturbation–based data poisoning} poses a severe risk, where tiny edits/perturbations are added to a small subset of training images and later trigger hallucinations on clean inputs. Despite their potency, effective defenses remain elusive. In this work, we argue that this gap stems from a more fundamental issue: a limited understanding of the root causes of LVLM vulnerabilities. To address this, we systematically study the fine-tuning process and, for the first time, identify data memorization as the key vulnerability: *LVLMs tend to over-memorize fine-tuning concepts, directly leading to hallucinations in fine-tuned models. Our finding overturns the usual story: the dominant driver is **over-memorization** of injected concepts, not the edits themselves.* Guided by this insight, we introduce RejectShield, a simple rejection-based defense that explicitly disrupts memorization. Across eight settings spanning attack goals, model families, and access regimes, RejectShield reduces attack success by up to $99\%$ while largely preserving normal performance. Finally, we discuss broader implications of this memorization vulnerability, including evaluation methods that test concept replay and training practices that mitigate memorization pressure. Our code and additional results are provided in the Appendix.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 12904
Loading