Keywords: Robust Vision RAG, Poison RAG, Mitigating Retrieval Poisoning, Semantic Coherence Refinement
Abstract: Retrieval-Augmented Generation (RAG) enhances text-to-image diffusion models by grounding generation in retrieved visual exemplars, but recent studies reveal that multimodal retrieval pipelines are highly vulnerable to poisoning attacks. When adversaries corrupt the retrieval database, semantically mismatched exemplars such as retrieving images of cats for prompts requesting dogs can mislead diffusion models into generating incorrect or misleading outputs. We identify this failure mode as a breakdown of semantic coherence between the text prompt and retrieved visual context. To address this issue, we propose a score-based semantic coherence refinement module that explicitly evaluates prompt-image consistency, refines misaligned prompt components, and re-retrieves corrected exemplars prior to diffusion. Acting as a multimodal feedback loop, the proposed method prevents poisoned retrieval from propagating semantic errors into the generative process. Extensive experiments demonstrate that our approach significantly improves semantic correctness, alignment, and robustness under both clean and poisoned retrieval settings, establishing an effective and principled defense for Vision RAG-augmented diffusion models.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, cross-modal content generation, cross-modal application, cross-modal application
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 8350
Loading