Keywords: backdoor attacks, RAG mechanims, diffusion models, text-to-image models
Abstract: Retrieval-augmented diffusion models (RAG-DMs) have gained widespread adoption across various applications, mitigating the data and compute demands of conventional diffusion models. Despite the success, their trustworthiness remains largely unexplored. Prior backdoor attacks have focused either on manipulating the image generation phase or on compromising the retrieval phase under the white-box setting, and they often suffer from knowledge conflicts between retrieved content and user prompts. To investigate the trustworthiness of black-box RAG-DMs, we propose the first jointly optimized backdoor (JOB) attack tailored to RAG-DMs under the black-box setting, which can jointly manipulate the generation and retrieval phases. Specifically, JOB injects a few target-class poisoned images into the knowledge base and learns simply a trigger through multi-objective optimization, guiding retrieval toward poisoned images and aligning the generated image with the target class while preserving benign performance. Experiments show that our method can effectively attack the black-box RAG-DMs with a high success rate compared to state-of-the-art methods.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 5895
Loading