\section{Introduction}

Recent research efforts on conditional generative modeling, such as Imagen~\citep{saharia2022photorealistic}, DALL$\cdot$E~2~\citep{ramesh2022hierarchical}, and Parti~\citep{yu2022scaling}, have advanced text-to-image generation to an unprecedented level, producing accurate, diverse, and even creative images from text prompts. These models leverage paired image-text data at Web scale (with hundreds of millions of training examples), and powerful backbone generative models, \textit{i.e.}, autoregressive models~\citep{van2017neural,ramesh2021zero,yu2022scaling}, diffusion models~\citep{ho2020denoising,dhariwal2021diffusion}, \textit{etc.}, and generate highly realistic images. Studying these models' generation results, we discovered their outputs are surprisingly sensitive to the frequency of the entities (or objects) in the text prompts. In particular, when generating text prompts about frequent entities (see Appendix~\ref{frequent}), these models often generate realistic images, with faithful grounding to the entities' visual appearance. However, when generating from text prompts with less frequent entities, those models either hallucinate non-existent entities, or output related frequent entities (see~\autoref{fig:comparison}), failing to establish a connection between the generated image and the visual appearance of the mentioned entity. This key limitation can greatly harm the trustworthiness of text-to-image models in real-world applications and even raise ethical concerns.

\begin{figure}[!thb]
    \centering
    \includegraphics[width=1.0\linewidth]{comparison.001.jpeg}
    \caption{{Comparison of images generated by Imagen and {Re-Imagen}\xspace on less frequent entities}. We observe that Imagen hallucinates the entities while {Re-Imagen}\xspace maintains better faithfulness.}
    \label{fig:comparison}
    \vspace{-2ex}
\end{figure}
In this paper, we propose a \textbf{Re}trieval-augmented Text-to-\textbf{Ima}ge \textbf{Gen}erator ({Re-Imagen}\xspace), which alleviates such limitations by searching for entity information in a multi-modal knowledge base, rather than attempting to memorize the appearance of rare entities. Specifically, we define our multi-modal knowledge base encodes the visual appearances and descriptions of entities with a collection of reference \texttt{<}image, text\texttt{>} pairs'. To use this resource, {Re-Imagen}\xspace first uses the input text prompt to retrieve the most relevant \texttt{<}image, text\texttt{>} pairs from the external multi-modal knowledge base, then uses the retrieved knowledge as model additional inputs to synthesize the target images. Consequently, the retrieved references provide knowledge regarding the semantic attributes and the concrete visual appearance of mentioned entities to guide {Re-Imagen}\xspace to paint the entities in the target images. 

The backbone of {Re-Imagen}\xspace is a cascaded diffusion model \citep{ho2022cascaded}, which contains three independent generation stages (implemented as U-Nets \citep{ronneberger2015u}) to gradually produce high-resolution (\textit{i.e.}, 1024{$\times$}1024) images.
In particular, we train {Re-Imagen}\xspace on a dataset constructed from the image-text dataset used by Imagen~\citep{saharia2022photorealistic}, where each data instance is associated with the top-k nearest neighbors within the dataset, based on text-only BM25 score. The retrieved top-k \texttt{<}image, text\texttt{>} pairs will be used as a reference for the model attend to. During inference, we design an interleaved guidance schedule that switches between text guidance and retrieval guidance, which ensures both text alignment and entity alignment. We show some examples generated by {Re-Imagen}\xspace, and compare them against Imagen in~\autoref{fig:comparison}. We can qualitatively observe that the our images are more faithful to the appearance of the reference entity.

To further quantitatively evaluate {Re-Imagen}\xspace, we present \textit{zero-shot text-to-image generation} results on two challenging datasets: COCO~\citep{lin2014microsoft} and WikiImages~\citep{chang2022webqa}\footnote{The original WikiImages database contains (entity image, entity description) pairs. It was a crawled from Wikimedia Commons for visual question answering, and we repurpose it here for text-to-image generation.}. {Re-Imagen}\xspace uses an external non-overlapping image-text database as the knowledge base for retrieval and then grounds on the retrieval to synthesize the target image. We show that {Re-Imagen}\xspace achieves the state-of-the-art performance for text-to-image generation on COCO and WikiImages, measured in FID score~\cite{heusel2017gans}, among non-fine-tuned models. For non-entity-centric dataset COCO, the perform gain is coming from biasing the model to generate images with similar styles as the retrieved in-domain images. For the entity-centric dataset WikiImages, the performance gain comes from grounding the generation on retrieved images containing similar entities. We further evaluate {Re-Imagen}\xspace on a more challenging benchmark --- EntityDrawBench, to test the model's ability to generate a variety of infrequent entities (dogs, landmarks, foods) in different scenes. We compare {Re-Imagen}\xspace with Imagen~\citep{saharia2022photorealistic}, DALL-E 2~\citep{ramesh2022hierarchical} and StableDiffusion~\citep{rombach2022high} in terms of faithfulness and photorealism with human raters. We demonstrate that that {Re-Imagen}\xspace can reach around 71\% on the faithfulness score, greatly improving from Imagine's score of 27\% and DALL-E 2's score of 48\% .  Analysis shows that the improvement mostly comes from low-frequency entities.

To summarize, our key contributions are:  {(1)} a novel retrieval-augmented text-to-image model {Re-Imagen}\xspace, which achieves SoTA FID scores on two dasets; {(2)} interleaved classifier-free guidance during sampling to ensure both text alignment and entity fidelity; and  {(3)} We introduce EntityDrawBench and show that {Re-Imagen}\xspace can significantly improve faithfulness on less-frequent entities.

\section{Related Work}

\noindent \textbf{Text-to-Image Diffusion Models} There has been a wide-spread success~\citep{ashual2022knn,ramesh2022hierarchical,saharia2022photorealistic,nichol2021glide} in modeling text-to-image generation with diffusion models, which has outperformed GANs~\citep{goodfellow2014generative} and auto-regressive Transformers~\citep{ramesh2021zero} in photorealism and diversity (under similar model size), without training instability and mode collapsing issues. Among them, some recent large text-to-image models such as Imagen~\citep{saharia2022photorealistic}, GLIDE~\citep{nichol2021glide}, and DALL-E2~\citep{ramesh2022hierarchical} have demonstrated excellent generation from complex prompt inputs. These models achieve highly fine-grained control over the generated images with text inputs. However, they do not perform explicit grounding over external visual knowledge and are restricted to memorizing the visual appearance of every possible visual entity in their parameters. This makes it difficult for them to generalize to rare or even unseen entities. In contrast, {Re-Imagen}\xspace is designed to free the diffusion model from memorizing, as models are encouraged to retrieve semantic neighbors from the knowledge base and use retrievals as context to paint the image. {Re-Imagen}\xspace improves the grounding of the diffusion models to real-world knowledge and is therefore capable of faithful image synthesis. \vspace{1ex} \\
\noindent \textbf{Concurrent Work} There are several concurrent works~\citep{li2022memory,blattmann2022retrieval,ashual2022knn}, that also leverage retrieval to improve diffusion models.  RDM~\citep{blattmann2022retrieval} is trained similarly to {Re-Imagen}\xspace, using examples and near neighbors, but the neighbors in RDM are selected using image features, and at inference time retrievals are replaced with user-chosen exemplars. RDM was shown to effectively transfer artistic style from exemplars to generated images.  In contrast, our proposed {Re-Imagen}\xspace conditions on both text and multi-modal neighbors to generate the image, includes retrieval at inference time, and is demonstrated to improve performance on rare images (as well as more generally).  KNN-Diffusion~\citep{ashual2022knn} is more closely related work to us, as it also uses retrieval to the quality of generated images. However, KNN-Diffusion uses discrete image representations, while {Re-Imagen}\xspace uses the raw pixels, and {Re-Imagen}\xspace's retrieved neighbors can be \texttt{<}image, text\texttt{>} pairs, while KNN-Diffusion's are only images. Quantitatively, {Re-Imagen}\xspace{} outperforms  KNN-Diffusion on the COCO dataset significantly. \vspace{1ex}\\
\noindent \textbf{Others} Due to the space limit, we provide an additional literature review in the Appendix~\ref{extended_review}.


\section{Model}

In this section, we discuss our proposed {Re-Imagen}\xspace in detail. We start with background knowledge, in the form of a brief overview of the cascaded diffusion models used by Imagen~\citep{wang2022high}. Next, we describe the concrete technical details on how we incorporate multi-modal retrieval for {Re-Imagen}\xspace. Finally, we discuss training and the interleaved guidance sampling for {Re-Imagen}\xspace.

\subsection{Preliminaries}
\noindent \textbf{Diffusion Models}
Diffusion models~\citep{sohl2015deep} are latent variable models, parameterized by $\theta$, in the form of $p_{\theta}(\bm{x}_0) := \int p_{\theta}(\bm{x}_{0:T})d\bm{x}_{1:T}$, where $\bm{x}_1, \cdots, \bm{x}_T$ are ``noised'' latent versions of the input image $\bm{x}_0 \sim q(\bm{x}_0)$. Note that the dimensionality of both latents and the image are the same throughout the entire process, with $\bm{x}_{0:T} \in \mathbb{R}^d$ and $d$ equals the product of \texttt{<}height, width, \# of channels\texttt{>}. The process that computes the posterior distribution $q(\bm{x}_{1:T}|\bm{x}_0)$ is also called the forward (or diffusion) process, and is implemented as a predefined Markov chain that gradually adds Gaussian noise to the data according to a schedule $\beta_t$:
\begin{equation}
    q(\bm{x}_{1:T}|\bm{x}_0)=\prod_{t=1}^{T}q(\bm{x}_t | \bm{x}_{t-1}) \quad \quad q(\bm{x}_t | \bm{x}_{t-1}) := \mathcal{N}(\bm{x}_t; \sqrt{1 - \beta_t}\bm{x}_{t-1}, \beta_t \bm{I})
\end{equation}

Diffusion models are trained to learn the image distribution by reversing the diffusion Markov chain. Theoretically, this reduces to learning to denoise $\bm{x}_t \sim q(\bm{x}_t|\bm{x}_0)$ into $\bm{x}_0$, with a time re-weighted square error loss---see~\cite{ho2020denoising} for the complete proof:
\begin{equation}
\label{eq:loss}
    \mathbb{E}_{\bm{x}_0, \bm{\epsilon}, t} [w_t \cdot ||\hat{\bm{x}}_{\theta}(\bm{x}_t, \bm{c}) - \bm{x}_0||_2^2]
\end{equation}
Here, the noised image is denoted as $\bm{x}_t := \sqrt{\bar{\alpha}_t} \bm{x}_0 + \sqrt{1-\bar{\alpha}_t} \bm{\epsilon}$, $\bm{x}_0$ is the ground-truth image, $\bm{c}$ is the condition, $\bm{\epsilon} \sim \mathcal{N}(\bf{0}, I)$ is the noise term, $\alpha_t: = 1 - \beta_t$ and $\bar{\alpha}_t := \prod_{s=1}^t \alpha_s$. To simplify notation, we will allow the condition $\bm{c}$ include multiple conditioning signals, such as text prompts $\bm{c}_p$, a low-resolution image input $\bm{c}_x$ (which is used in super-resolution), or retrieved neighboring images $\bm{c}_n$ (which are used in {Re-Imagen}\xspace).  Imagen~\citep{saharia2022photorealistic} uses a U-Net~\citep{ronneberger2015u} to implement $\bm{\epsilon}_{\theta}(\bm{x}_{t},  \bm{c}, t)$. The U-Net represents the reversed noise generator as follows:
\begin{equation}
\label{eq:recover_original}
    \hat{\bm{x}}_{\theta}(\bm{x}_t, \bm{c}) := (\bm{x}_t - \sqrt{1 - \bar{\alpha}_t} \bm{\epsilon}_{\theta}(\bm{x}_{t}, \bm{c}, t)) / \sqrt{\bar{\alpha}_t}
\end{equation}

During the training, we randomly sample $t \sim \mathcal{U}([0, 1])$ and image $\bm{x}_0$ from the dataset $\mathcal{D}$, and minimize the difference between $\hat{\bm{x}}_{\theta}(\bm{x}_t, \bm{c})$ and $\bm{x}_0$ according to~\autoref{eq:loss}. At the inference time, the diffusion model uses DDPM~\citep{ho2020denoising} to sample recursively as follows:
\begin{equation}
    \bm{x}_{t-1} = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \hat{\bm{x}}_{\theta}(\bm{x}_t, \bm{c}) + \frac{\sqrt{\alpha_t}(1- \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t}\bm{x}_t + \sqrt{\frac{(1 - \bar{\alpha}_{t-1})\beta_t}{1 - \bar{\alpha}_t}}\bm{\epsilon}
\end{equation}
The model sets $\bm{x}_T$ as a Gaussian noise with $T$ denoting the total number of diffusion steps, and then keep sampling in reverse until step $T=0$, i.e. $\bm{x}_{T} \rightarrow \bm{x}_{T-1} \rightarrow \cdots$,  to reach the final image $\hat{\bm{x}}_0$.

For better generation efficiency, cascaded diffusion models~\citep{ho2022cascaded,ramesh2022hierarchical,saharia2022photorealistic} use three separate diffusion models to generate high-resolution images gradually, going from low resolution to high resolution. The three models 64$\times$ model, 256$\times$ super-resolution model and 1024$\times$ super-resolution model gradually increase the model resolution to $1024\times1024$.


\begin{figure}[!t]
    \centering
    \includegraphics[width=1.0\linewidth]{model.001.jpeg}
    \caption{
        An illustration of the text-to-image generation pipeline in the 64$\times$ diffusion model. Specifically, {Re-Imagen}\xspace learns a UNet to iteratively predict $\bm{\epsilon}(\bm{x_t}, \bm{c}_n, \bm{c}_p, t)$ that denoises the image. ($\bm{c}_n$: a set of retrieved image-text pairs from the database; $\bm{c}_p$: input text prompt; $t$: current time-step) 
    }
    \label{fig:model}
\end{figure}

\noindent \textbf{Classifier-free Guidance}
\label{sec:cfg}
\cite{ho2021classifier} first proposed classifier-free guidance to trade off diversity and sample quality. This sampling strategy has been widely used due to its simplicity. In particular, Imagen~\citep{saharia2022photorealistic} adopts an adjusted $\epsilon$-prediction as follows:
\begin{equation}
    \hat{\bm{\epsilon}} = w \cdot \bm{\epsilon}_{\theta}(\bm{x}_t, \bm{c}, t) - (w - 1) \cdot \bm{\epsilon}_{\theta}(\bm{x}_t, t)
\end{equation}
where $w$ is the guidance weight. The unconditional $\epsilon$-prediction $\bm{\epsilon}_{\theta}(\bm{x}_t, t)$ is calculated by dropping the condition, i.e. the text prompt.

\subsection{Generating Image with Multi-Modal Knowledge}

Similar to Imagen~\citep{saharia2022photorealistic},  {Re-Imagen}\xspace is a cascaded diffusion model, consisting of 64$\times$, 256$\times$, and 1024$\times$ diffusion models. However, {Re-Imagen}\xspace augments the diffusion model with the new capability of leveraging multimodal `knowledge' from the external database, thus freeing the model from memorizing the appearance of rare entities. For brevity (and concreteness) we present below a  high-level overview of the 64$\times$ model: the others are similar.


\noindent \textbf{Main Idea} As shown in~\autoref{fig:model}, during the denoising process, {Re-Imagen}\xspace conditions its generation result not only on the text prompt $\bm{c}_p$ (and also with $\bm{c}_x$ for super-resolution), but on the neighbors $\bm{c}_n$ that were retrieved from the external knowledge base. Here, the text prompt $\bm{c}_p \in \mathbb{R}^{n \times d}$ is represented using a T5 embedding~\citep{raffel2020exploring}, with $n$ being the text length and $d$ being the embedding dimension. Meanwhile, the top-k neighbors $\bm{c}_n:=[\texttt{<}\text{image, text}\texttt{>}_1, \cdots, \texttt{<}\text{image, text}\texttt{>}_k]$ are retrieved from external knowledge base $\mathcal{B}$, using the input prompt $p$ as the query and a retrieval similarity function $\gamma(p, \mathcal{B})$. We experimented with two different choices for the similarity function: maximum inner product scores for BM25~\citep{robertson2009probabilistic} and CLIP~\citep{radford2021learning}. \vspace{1ex} \\
\noindent \textbf{Model Architecture} We show the architecture of our model in~\autoref{fig:detail}, where we decompose the UNet into the downsampling encoder (DStack) and the upsampling decoder (UStack). Specifically, the DStack takes an image, a text, and a time step as the input, and generates a feature map, which is denoted as $f_{\theta}(\bm{x}_t, \bm{c}_p, t) \in \mathbb{R}^{F \times F \times d}$, with $F$ denoting the feature map width and $d$ denoting the hidden dimension. We share the same DStack encoder when we encode the retrieved \texttt{<}image, text\texttt{>} pairs (with $t$ set to zero) which produces a set of feature maps $f_{\theta}(\bm{c}_n, 0) \in \mathbb{R}^{K \times F \times F \times d}$. We then use a multi-head attention module~\citep{vaswani2017attention} to extract the most relevant information to produce a new feature map $f'_{\theta}(\bm{x}_t, \bm{c}_p, \bm{c}_n, t) = Attn(f_{\theta}(\bm{x}_t, \bm{c}_p, t), f_{\theta}(\bm{c}_n, 0))$.
The upsampling stack decoder then predicts the noise term $\bm{\epsilon}_{\theta}(\bm{x}_{t}, \bm{c}_p, \bm{c}_n, t)$ and uses it to compute $\hat{\bm{x}}_{\theta}$ with~\autoref{eq:recover_original}, which is either used for regression during training or DDPM sampling. \vspace{1ex} \\
\noindent \textbf{Model Training} In order to train {Re-Imagen}\xspace, we construct a new dataset KNN-ImageText based on the 50M ImageText-dataset used in Imagen. There are two motivations for selecting this dataset. (1) the dataset contains many similar photos regarding specific entities, which is extremely helpful for obtaining similar neighbors, and (2) the dataset is highly sanitized with fewer unethical or harmful images. For each instance in the 50M ImageText-dataset, we search over the same dataset with text-to-text BM25 similarity to find the top-2 neighbors as $\bm{c}_n$ (excluding the query instance). We experimented with both CLIP and BM25 similarity scores, and retrieval was implemented with ScaNN~\citep{guo2020accelerating}. We train {Re-Imagen}\xspace on the KNN-ImageText by minimizing the loss function of~\autoref{eq:loss}. During training, we also randomly drop the text and neighbor conditions independently with 10\% chance. Such random dropping will help the model learn the marginalized noise term $\bm{\epsilon}_{\theta}(\bm{x}_{t}, \bm{c}_p, t)$ and $\bm{\epsilon}_{\theta}(\bm{x}_{t}, \bm{c}_n, t)$, which will be used for the classifier-free guidance. \vspace{1ex} \\
\noindent \textbf{Interleaved Classifier-free Guidance}
\label{sec:interleaved}
Different from existing diffusion models, our model needs to deal with more than one condition, \textit{i.e.}, text prompts $\bm{c}_t$ and retrieved neighbors $\bm{c}_n$, which allows new options for incorporating guidance. In particular, {Re-Imagen}\xspace could use classifier-free guidance by subtracting the unconditioned $\epsilon$-predictions, or either of the two partially conditioned $\epsilon$-predictions. Empirically, we observed that subtracting unconditioned $\epsilon$-predictions (the standard classifier-free guidance of~\autoref{sec:cfg}) often leads to an undesired imbalance, where the outputs are either dominated by the text condition or the neighbor condition. Hence, we designed an interleaved guidance schedule that balances the two conditions. Formally, we define the two adjusted $\epsilon$-predictions as:
\begin{align}
\label{eq:sample}
\begin{split}
    \hat{\bm{\epsilon}}_{p} &= w_p \cdot \bm{\epsilon}_{\theta}(\bm{x}_t, \bm{c}_p, \bm{c}_n, t) - (w_p - 1) \cdot \bm{\epsilon}_{\theta}(\bm{x}_t, \bm{c}_n, t) \\
    \hat{\bm{\epsilon}}_{n} &= w_n \cdot \bm{\epsilon}_{\theta}(\bm{x}_t, \bm{c}_p, \bm{c}_n, t) - (w_n - 1) \cdot \bm{\epsilon}_{\theta}(\bm{x}_t, \bm{c}_p, t)
\end{split}
\end{align}
where $\hat{\bm{\epsilon}}_{p}$ and $\hat{\bm{\epsilon}}_{n}$ are the text-enhanced and neighbor-enhanced $\epsilon$-predictions, respectfully. Here, $w_p$ is the text guidance weight and $w_n$ is the neighbor guidance weight. We then interleave the two guidance predictions by a certain predefined ratio $\eta$.  Specifically, at each guidance step, we sample a [0, 1]-uniform random number $R$, and $R < \eta$, we use $\hat{\bm{\epsilon}}_{p}$, and otherwise $\hat{\bm{\epsilon}}_{n}$. We can adjust $\eta$ to balance the faithfulness w.r.t text description or the retrieved image-text pairs. In EntityDrawBench experiment, we found that $\eta=0.55$ can lead to better quality.
\begin{figure}[!t]
    \centering
    \includegraphics[width=1.0\linewidth]{kimgen.001.jpeg}
    \caption{The detailed architecture of our model. The retrieved neighbors are first encoded using the DStack encoder and then used to augment the intermediate representation of the denoising image (via cross-attention). The augmented representation is fed to the UStack to predict the noise.}
    \label{fig:detail}
    \vspace{-2ex}
\end{figure}


\section{Experiments}
{Re-Imagen}\xspace consists of three submodels: a 2.5B 64$\times$64 text-to-image model, a 750M 256$\times$256 super-resolution model and a 400M 1024$\times$1024 super-resolution model. We finetune these models on the constructed KNN-ImageText dataset. We evaluate the model under two settings: (1) automatic evaluation on COCO and WikiImages dataset, to measure the model's general performance to generate photorealistic images, and (2) human evaluation on the newly introduced EntityDrawBench, to measure the model's capability to generate long-tail entities.

\noindent \textbf{Training and Evaluation details}
The fine-tuning was run for 200K steps on 64 TPU-v4 chips and completed within two days. We use Adafactor for the 64$\times$ model and Adam for the 256$\times$ super-resolution model with a learning rate of 1e-4. We set the number of neighbors $k$=2 and set $\gamma$=BM25 during training. For the image-text database $\mathcal{B}$, we consider three different variants: {(1)} the COCO/WikiImages training set, which contains non-overlapping small-scale in-domain image-text pairs, {(2)} the ImageText dataset containing 50M \texttt{<}image, caption\texttt{>} pairs, and {(3)} the LAION dataset~\citep{schuhmann2021laion} containing 400M \texttt{<}image, text\texttt{>} crawled pairs. Since indexing ImageText and LAION with CLIP encodings is expensive, we only considered the BM25 retriever for these databases. For the COCO/WikiImages training set, we used both BM25 and CLIP.

\subsection{Evaluation on COCO and WikiImages}
In these two experiments, we used the standard non-interleaved classifier-free guidance (\autoref{sec:cfg}) with $T$=1000 steps for both the 64$\times$ diffusion model and 256$\times$ super-resolution model. The guidance weight $w$ for the 64$\times$ model is swept over [1.0, 1.25, 1.5, 1.75, 2.0], while the 256$\times$256 super-resolution models' guidance weight $w$ is swept over [1.0, 5.0, 8.0, 10.0]. We select the guidance $w$ with the best FID score, which is reported in~\autoref{tab:coco}. We also demonstrate examples in~\autoref{fig:dataset}. \vspace{1ex} \\
\noindent \textbf{COCO Results}
COCO is the most widely-used benchmark for text-to-image generation models. Although COCO does not contain many rare entities, it does contain unusual combinations of common entities, so it is plausible that retrieval augmentation could also help for some challenging text prompts. We adopt FID~\citep{heusel2017gans} score to measure image quality. Following the previous literature, we randomly sample 30K prompts from the validation set as input to the model. The generated images are compared with the reference images from the full validation set (42K). We list the results in two columns: FID-30K denotes the model with access to the COCO train set (either to fine-tune or retrieve from), while Zero-shot FID-30K does not have access to any COCO data. 
\begin{table}[!t]
    \small
    \centering
    \begin{tabular}{lccc}
        \toprule
        Model & \# of Params & FID-30K & Zero-shot FID-30K \\
        \midrule
       
       
        GLIDE~\citep{nichol2021glide} & \hphantom{0}\pd5B & - & 12.24 \\
        DALL-E 2~\citep{ramesh2022hierarchical} & $\sim$5B & - & 10.39 \\
        VQ-Diffusion~\citep{gu2022vector} & 0.4B & - & 19.75 \\
        KNN-Diffusion~\citep{ashual2022knn} & 0.8B & - & 16.66 \\
       
        Stable-Diffusion~\citep{rombach2022high} & \hphantom{.}\pz1B & - & 12.63 \\
        Imagen~\citep{saharia2022photorealistic} & \hphantom{.}\pz3B  & - & \pz7.27 \\
        Make-A-Scene~\citep{gafni2022make} & \hphantom{.}\pz4B  &  7.55 & 11.84 \\
        Parti~\citep{yu2022scaling} & \pd20B  & \textbf{3.22} & \pz7.23 \\
        \midrule
        {Re-Imagen}\xspace ($\gamma$=BM25; $\mathcal{B}$=COCO; $k$=2) & 3.6B & \textbf{5.25}$^\dagger$ & - \\
        {Re-Imagen}\xspace ($\gamma$=CLIP; $\mathcal{B}$=COCO; $k$=2) & 3.6B & 5.29$^\dagger$ & - \\
        {Re-Imagen}\xspace ($\gamma$=BM25; $\mathcal{B}$=ImageText; $k$=2) & 3.6B & - & \pz7.02 \\
        {Re-Imagen}\xspace ($\gamma$=BM25; $\mathcal{B}$=LAION; $k$=2) & 3.6B & - & \hphantom{0} 6.88 \\
        \bottomrule
    \end{tabular}
    \caption{MS-COCO results for zero-shot text-to-image generation. We use a guidance weight of 1.25 for the 64$\times$ diffusion model and 5 for our 256$\times$ super-resolution model. ($\dagger$: {Re-Imagen}\xspace \textit{is not fine-tuned} on the COCO data---it only uses it as the knowledge base for retrieval.)  }
    \label{tab:coco}
    \vspace{-2ex}
\end{table}

{Re-Imagen}\xspace (with the COCO database) can achieve a significant gain on FID-30K without fine-tuning: roughly a 2.0 absolute FID improvement over Imagen. The performance is even better than fine-tuned Make-A-Scene~\citep{gafni2022make}, but slightly worse than fine-tuned 20B Parti. In contrast, {Re-Imagen}\xspace retrieving from out-of-domain databases (LAION) achieves less gain, but still obtains a 0.4 FID improvement over Imagen. {Re-Imagen}\xspace outperforms KNN-Diffusion, another retrieval-augmented diffusion model, by a large margin.

Since COCO does not contain infrequent entities, `entity knowledge' is not important. In contrast, retrieving from the training set can provide useful `style knowledge' for the model to ground on. {Re-Imagen}\xspace is able to adapt the generated images to the same style of the COCO distribution, it can achieve a much better FID score. As can be seen in the upper part of from~\autoref{fig:dataset}, {Re-Imagen}\xspace with retrieval generates images of the same style as COCO, while without retrieval, the output is still high quality, but the style is less similar to COCO.
\begin{figure}[!t]
    \centering
    \includegraphics[width=1.0\linewidth]{dataset.001.jpeg}
    \caption{{The retrieved top-2 neighbors of COCO and WikiImages and model generation.}}
    \label{fig:dataset}
    \vspace{-2ex}
\end{figure}

\noindent\textbf{WikiImages Results}
WikiImages is constructed based on the multimodal corpus provided in Web\-QA~\citep{chang2022webqa}, which consists of \texttt{<}image, text\texttt{>} pairs crawled from Wikimedia Commons\footnote{\url{https://commons.wikimedia.org/wiki/Main_Page}}. We filtered the original corpus to remove noisy data (see the Appendix\ref{wikiimages}), which leads to a total of 320K examples. We randomly sample 22K as our validation set to perform zero-shot evaluation, we further sample 20K prompts from the dataset as the input. Similar to the previous experiment, we also adopt the guidance weight schedule as before and evaluate 256$\times$256 images. We report our experimental results in~\autoref{tab:wiki} and mainly compare with Imagen and Stable-Diffusion.

\begin{table}[!t]
    \small
    \centering
    \begin{tabular}{lccc}
        \toprule
        Model & \# of Params & FID-30K & Zero-shot FID-20K \\
        \midrule
        Stable-Diffusion~\citep{rombach2022high} & \hphantom{.}\pz1B & - & 7.50 \\
        Imagen~\citep{saharia2022photorealistic} & \hphantom{.}\pz3B & - & 6.44 \\
        \midrule
        {Re-Imagen}\xspace ($\gamma$=BM25; $\mathcal{B}$=WikiImages; $k$=2) & 3.6B & 5.88 & -  \\
        {Re-Imagen}\xspace ($\gamma$=CLIP; $\mathcal{B}$=WikiImages; $k$=2) & 3.6B & 5.85 & -  \\
        {Re-Imagen}\xspace ($\gamma$=BM25; $\mathcal{B}$=ImageText; $k$=2) & 3.6B & - & 6.04  \\
        {Re-Imagen}\xspace ($\gamma$=BM25; $\mathcal{B}$=LAION; $k$=1) & 3.6B & - & 5.94 \\
        {Re-Imagen}\xspace ($\gamma$=BM25; $\mathcal{B}$=LAION; $k$=2) & 3.6B & - & 5.82 \\
        {Re-Imagen}\xspace ($\gamma$=BM25; $\mathcal{B}$=LAION; $k$=3) & 3.6B & - & \textbf{5.80} \\
        \bottomrule
    \end{tabular}
    \caption{WikiImages results for zero-shot text-to-image generation. We use a guidance weight of 1.5 for the 64$\times$ diffusion model and 5 for our 256$\times$ super-resolution model.}
    \label{tab:wiki}
    \vspace{-2ex}
\end{table}
From~\autoref{tab:wiki}, we found that using out-of-domain LAION-400M as the database actually achieves better performance than using in-domain WikiImages as the database. Unlike COCO, Wiki\-Images contains mostly entity-focused images, thus the importance of finding relevant entities is the database is more important than distilling the styles from the training set---and since the scale of LAION-400M is 100x larger than WikiImages-300K, the chance of retrieving related entities is much higher, which leads to better performance. One example is depicted in the lower part of~\autoref{fig:dataset}, where the LAION retrieval finds `Island of San Giorgio Maggiore', which helps the model generate the classical Renaissance-style church. When generating without retrieval, the model is not able to generate the specific church. This indicates the importance of having relevant entities in the retrievals for WikiImages dataset and also explains why LAION database achieves the best results. We also present more examples from WikiImages in the Appendix~\ref{wikiimages}.


\subsection{Entity Focused Evaluation on EntityDrawBench}
\noindent \textbf{Dataset Construction}
We introduce EntityDrawBench to evaluate the model's capability to generate diverse sets of entities in different visual scenes. Specifically, we pick three types of visual entities (dog breeds, landmarks, and foods) from Wikipedia Commons and Google Landmarks to construct our prompts. In total, we collect 150 entity-centric prompts for evaluation. These prompts are mostly unique and we cannot find corresponding images with Google Image Search. More construction details are in Appendix~\ref{entitydrawbench}.

We use the prompt as the input and its corresponding image-text pairs as the `retrieval' for {Re-Imagen}\xspace, to generate four 1024$\times$1024 images. For the other models, we feed the prompts directly also to generate four images. We will pick the best image of these four samples to rate its Photorealism and Faithfulness. For photorealism, we assign 1 if the image is moderately realistic without noticeable artifacts, otherwise, we assign a score of 0. For the faithfulness measure, we assign 1 if the image is faithful to both the entity source and the text description, otherwise, we assign 0.

\noindent \textbf{Experimental Results}
We use the proposed interleaved classifier-free guidance (\autoref{sec:interleaved}) for the 64$\times$ diffusion model, which runs for 256 diffusion steps under a strong guidance weight of $w$=30 for both text and neighbor conditions. For the 256$\times$ and 1024$\times$ resolution models, we use a constant guidance weight of 5.0 and 3.0, respectively, with 128 and 32 diffusion steps. The inference speed is 30-40 secs for 4 images on 4 TPU-v4 chips. We demonstrate our human evaluation results for faithfulness and photorealism in~\autoref{tab:faithfulness}. 

\begin{table}[!t]
    \small
    \centering
    \begin{tabular}{l|cccc|c}
        \toprule
        \multirow{2}{*}{Model} & \multicolumn{4}{c|}{\textbf{Faithfulness}} & \textbf{Photorealism} \\
        & Dogs & Foods & Landmarks & All & All \\
        \midrule \addlinespace
        Imagen  &  0.28 $\pm$ 0.02  & 0.26 $\pm$ 0.02 & 0.27 $\pm$ 0.02  & 0.27  & \textbf{0.98} \\
        DALL-E 2 & 0.60 $\pm$ 0.02 &  0.47 $\pm$ 0.02 & 0.36 $\pm$ 0.04 & 0.48 & \textbf{0.98} \\
        Stable-Diffusion & 0.16 $\pm$ 0.02 & 0.24 $\pm$ 0.04   & 0.12 $\pm$ 0.06 & 0.17 & 0.92 \\
        \midrule \addlinespace
        {Re-Imagen}\xspace & \textbf{0.68} $\pm$ 0.04 &  \textbf{0.70} $\pm$ 0.02 & \textbf{0.74} $\pm$ 0.04 & \textbf{0.71} & 0.97 \\
        \bottomrule
    \end{tabular}
    \caption{Human evaluation results for different models on different types of entities. }
    \label{tab:faithfulness}
    \vspace{-2ex}
\end{table}
We can observe that {Re-Imagen}\xspace can in general achieve much higher faithfulness than the existing models while maintaining similar photorealism scores. When comparing with our backbone Imagen, we see the faithfulness score jumps from 27\% to 71\%, which indicates that our model is paying attention to the retrieved knowledge and assimilating it into the generation process. 

\begin{figure}[!h]
    \centering
    \begin{tabular}{@{}c@{}c@{}}
        \includegraphics[height=3cm]{figures/frequent_entity_v3} & \includegraphics[height=3cm]{figures/infrequent_entity_v3.png} \\
    \end{tabular}
    \vspace{-2ex}
    \caption{The human evaluation scores for both frequent and infrequent entities. }
    \label{fig:human_evaluation_frequency}
\end{figure}
We further partition the entities into `frequent' and `infrequent' categories based on their frequency (top 50\% as `frequent') in Imagen's training corpus. We plot faithfulness score for `frequent' and `infrequent' separately in~\autoref{fig:human_evaluation_frequency} . We can see that our model is less sensitive to the frequency of the input entities than the other models, with only a 10-20\% drop on infrequent entities. In contrast, both Imagen and DALL-E 2 drop by 40\%-50\% on infrequent entities. This study reflects the effectiveness of text-to-image generation models on long-tail entities.


\noindent \textbf{Comparison to Other Models}
We demonstrate some examples from different models in~\autoref{fig:example}. As can be seen, the images generated from {Re-Imagen}\xspace strike a good balance between text alignment and entity fidelity. Unlike image editing to perform in-place modification, {Re-Imagen}\xspace can transform the neighbor entities both geometrically and semantically according to the text guidance. As a concrete example, {Re-Imagen}\xspace generates the \textit{Braque Saint-Germain} (2nd row in \autoref{fig:example}) on the grass, in a different viewpoint from to the reference image.
\begin{figure}[!t]
    \centering
    \includegraphics[width=0.95\linewidth]{example.001.jpeg}
    \caption{None-cherry picked examples from EntityDrawBench for different models. }
    \vspace{-1ex}
    \label{fig:example}
\end{figure}

\begin{figure}[!t]
    \centering
    \includegraphics[width=0.95\linewidth]{more_examples_for_arxiv.001.jpeg}
    \includegraphics[width=0.95\linewidth]{more_examples_for_arxiv.002.jpeg}
    \includegraphics[width=0.95\linewidth]{more_examples_for_arxiv.003.jpeg}
    \caption{Extra None-cherry picked examples from EntityDrawBench for different models. }
    \vspace{-1ex}
    \label{fig:more_example_for_arxiv}
\end{figure}

\noindent \textbf{Text and Entity Faithfulness Trade-offs}
In our experiments, we found that there is a trade-off between faithefulness to the text prompt and faithfulness to the retrieved entity images. Based on~\autoref{eq:sample}, by adjusting $\eta$, \textit{i.e.} the proportion of $\hat{\epsilon}_p$ and $\hat{\epsilon}_n$ in the sampling schedule, we can control {Re-Imagen}\xspace so as to generate images that explore this tradeoff: decreasing $\eta$ will increase the entity's image faithfulness but decrease the text alignment. In contrast, increasing the value $\eta$ will increase the text alignment and decrease the similarity to retrieved image. We demonstrate this in~\autoref{fig:transition}. With small $\eta$, the model ignores the text description and simply copies the retrieved image, while with large $\eta$, the model reverts to standard Imagen, without using to the input image much. We found that having $\eta$ around 0.5 is usually a `sweet spot' that balances both conditions.
\begin{figure}[!t]
    \centering
    \includegraphics[width=0.95\linewidth]{transition.001.jpeg}
    \caption{Ablation study of interleaved guidance ratio $\eta$ to show the trade-off. }
    \vspace{-2ex}
    \label{fig:transition}
\end{figure}


\section{Conclusions and Limitations}
We present {Re-Imagen}\xspace, a retrieval-augmented diffusion model, and demonstrate its effectiveness in generating realistic and faithful images. We exhibit such advantages not only through automatic FID measures on standard benchmarks (\textit{i.e.}, COCO and WikiImage) but also through human evaluation on the newly introduced EntityDrawBench. We further demonstrate that our model is particularly effective in generating an image from text that mentions rare entities.

{Re-Imagen}\xspace still suffers from well-known issues in text-to-image generation, which we review below in \autoref{broader_impact}. In addition, {Re-Imagen}\xspace also has some unique limitations due to the retrieval-augmented modeling. First, because {Re-Imagen}\xspace is sensitive the to retrieved image-text pairs it is conditioned on, when the retrieved image is of low-quality, there will be a negative influence on the generated image. Second, {Re-Imagen}\xspace sometimes still fail to ground on the retrieved entities when the entity's visual appearance is out of the generation space. Third, we noticed that the super-resolution model is less effective, and frequently misses low-level texture details of the visual entities. In future work, we plan to further investigate the above limitations and address them.

\section*{Ethics Statement}
\label{broader_impact}

Strong text-to-image generation models, \textit{i.e.}, Imagen~\citep{saharia2022photorealistic} and Parti~\citep{yu2022scaling}, raise ethical challenges along dimensions such as the \textit{social bias}. {Re-Imagen}\xspace is exposed to the same challenges, as we employed Web-scale datasets that are similar to the prior models.

The retrieval-augmented modeling techniques of {Re-Imagen}\xspace has substantially improved the controllability and attribution of the generated image.  Like many basic research topics, this additional control could be used for beneficial or harmful purposes.  One obvious danger is that {Re-Imagen}\xspace (or similar models)  could be used for malicious purposes like spreading misinformation, \textit{e.g.,} by producing realistic images of specific people in misleading visual contexts. On the other side, additional control has many potential benefits.  One general benefit is that {Re-Imagen}\xspace can reduce hallucination and increase the faithfulness of the generated image to the users' intent.  Another benefit is that the ability to work with tail entities makes the model more useful for minorities and other users in smaller communities: for example, {Re-Imagen}\xspace is more effective at generating images of landmarks famous in smaller communities or cultures, and generating images of indigenous foods and cultural artifacts. We argue that this model can help decrease the frequency-caused bias in current neural network based AI systems.

Considering such potential threats to the public, we have currently decide not to release the code or a public demo. In future work, we will explore a framework for responsible use that balances the value of external auditing of research with the risks of unrestricted open access, allowing this work to be used in a safe and beneficial way.

\section*{Acknoledgement}
We thank Jason Baldridge, Boqing Gong, Keran Rong and Slav Petrov for their valuable comments on an early version of the manuscript, which has helped improve this work. We also thank William Chan and Mohammad Norouzi for providing us with their support and the pre-trained models of Imagen, and Michiel de Jong for suggesting the model name.




