FineRAG: Fine-grained Retrieval-Augmented Text-to-Image Generation

Huaying Yuan; Ziliang Zhao; Shuting Wang; Shitao Xiao; Minheng Ni; Zheng Liu; Zhicheng Dou

FineRAG: Fine-grained Retrieval-Augmented Text-to-Image Generation

Huaying Yuan, Ziliang Zhao, Shuting Wang, Shitao Xiao, Minheng Ni, Zheng Liu, Zhicheng Dou

Published: 01 Jan 2025, Last Modified: 20 May 2025COLING 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recent advancements in text-to-image generation, notably the series of Stable Diffusion methods, have enabled the production of diverse, high-quality photo-realistic images. Nevertheless, these techniques still exhibit limitations in terms of knowledge access. Retrieval-augmented image generation is a straightforward way to tackle this problem. Current studies primarily utilize coarse-grained retrievers, employing initial prompts as search queries for knowledge retrieval. This approach, however, is ineffective in accessing valuable knowledge in long-tail text-to-image generation scenarios. To alleviate this problem, we introduce FineRAG, a fine-grained model that systematically breaks down the retrieval-augmented image generation task into four critical stages: query decomposition, candidate selection, retrieval-augmented diffusion, and self-reflection. Experimental results on both general and long-tailed benchmarks show that our proposed method significantly reduces the noise associated with retrieval-augmented image generation and performs better in complex, open-world scenarios.

Loading