Abstract: Large Vision-Language Models (LVLMs) are increasingly deployed in high-stakes applications, yet their training-time security remains poorly understood. As a prominent data poisoning attack specifically designed for LVLMs, ShadowCast achieves significant success in inducing targeted hallucinations, posing a serious threat to LVLM safety. ShadowCast’s success has been attributed to injected visual perturbations. Consequently, subsequent defenses have focused on visual purification; however, their effectiveness remains limited. In this paper, we present a re-analysis of the ShadowCast mechanism. Our key finding is that memorization during LVLM fine-tuning is an overlooked but major contributor to attack success, and it dominates at higher poison ratios. This factor has been largely overlooked in previous work. We further show that multimodal training exacerbates this vulnerability compared to unimodal settings. This insight fundamentally reframes both the threat model and the defense objective: if memorization is a major contributor, purification-only defenses are inherently insufficient in multimodal regimes. Motivated by this perspective, we propose RejectShield, a rejection-based defense that filters suspicious training samples prior to fine-tuning. Across extensive evaluations spanning 4 attack goals, 3 LVLMs, black-box and white-box attack settings, and 3 poisonings, RejectShield reduces the attack success rate by up to 99% while largely preserving model utility, significantly advancing defense effectiveness against LVLM poisoning. Code and additional results are provided in the Supp.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jean_Kossaifi1
Submission Number: 8492
Loading