RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning

Yuanhuiyi Lyu; Xu Zheng; Lutao Jiang; Yibo Yan; Xin Zou; Huiyu Zhou; Linfeng Zhang; Xuming Hu

RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning

Yuanhuiyi Lyu, Xu Zheng, Lutao Jiang, Yibo Yan, Xin Zou, Huiyu Zhou, Linfeng Zhang, Xuming Hu

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent text-to-image generative models, e.g., Stable Diffusion V3 and Flux, have achieved notable progress. However, these models are strongly restricted to their limited knowledge, a.k.a., their own fixed parameters, that are trained with closed datasets. This leads to significant hallucinations or distortions when facing fine-grained and unseen novel real-world objects, e.g., the appearance of the Tesla Cybertruck. To this end, we present **the first** real-object-based retrieval-augmented generation framework (**RealRAG**), which augments fine-grained and unseen novel object generation by learning and retrieving real-world images to overcome the knowledge gaps of generative models. Specifically, to integrate missing memory for unseen novel object generation, we train a reflective retriever by **self-reflective contrastive learning**, which injects the generator's knowledge into the sef-reflective negatives, ensuring that the retrieved augmented images compensate for the model's missing knowledge. Furthermore, the real-object-based framework integrates fine-grained visual knowledge for the generative models, tackling the distortion problem and improving the realism for fine-grained object generation. Our Real-RAG is superior in its modular application to **all types** of state-of-the-art text-to-image generative models and also delivers **remarkable** performance boosts with all of them, such as a **gain of *16.18\%* FID score** with the auto-regressive model on the Stanford Car benchmark.

Lay Summary: Text-to-image models, like Stable Diffusion V3 and Flux, have made impressive strides in generating images from text. However, these models often struggle when asked to generate highly specific or unseen objects, leading to strange or distorted results. For instance, they may fail to accurately generate new or detailed objects, such as a Tesla Cybertruck, because they only know what they've been trained on. To address this issue, we developed a new framework called RealRAG. This framework enhances text-to-image generation by incorporating real-world images to fill in the gaps in the model's knowledge. We introduced a novel approach called self-reflective contrastive learning to ensure the model retrieves relevant real-world images, allowing it to generate more realistic and accurate images of unfamiliar objects. RealRAG can be applied to any state-of-the-art text-to-image model and improves their performance significantly. For example, it improved the realism of auto-regressive models by 16.18%, demonstrating its ability to generate high-quality images of fine-grained objects.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://qc-ly.github.io/RealRAG-page/

Primary Area: Applications->Computer Vision

Keywords: Self-reflective Contrastive Learning, Real-object-based RAG

Submission Number: 5597

Loading