Keywords: RAG, image generation, rare-concept generation
TL;DR: Rare concept image generation using dynamically retrieved image references.
Abstract: While recent generative models synthesize high-quality visual content, they still struggle with generating rare or fine-grained concepts.
To address this challenge, we explore the usage of Retrieval-Augmented Generation (RAG) for image generation, and introduce ImageRAG, a training-free method for rare concept generation.
Using a Vision Language Model (VLM), ImageRAG identifies generation gaps between an input prompt and a generated image dynamically, retrieves relevant images, and uses them as context to guide the generation process.
Prior approaches that use retrieved images require training models specifically for retrieval-based generation. In contrast, ImageRAG leverages existing image conditioning models, and does not require RAG-specific training.
We demonstrate our approach is highly adaptable through evaluation over different backbones, including models trained to receive image inputs and models augmented with a post-training image-prompt adapter.
Through extensive quantitative, qualitative, and subjective evaluation, we show that incorporating retrieved references consistently improves the generation abilities of rare and fine-grained concepts across three datasets and three generative models.
Primary Area: generative models
Submission Number: 5918
Loading