\section*{Acknowledgements}
The authors would like to thank Ian Gemp, Chongli Qin, and Yoram Bachrach for fruitful discussions. AJB was generously supported by an IVADO Ph.D. fellowship.
\section{Further Related Work}
As described by [] and [], the desiderata for a generative model can be broken down into 3 main categories:

\begin{itemize}
    \item Sample quality: Are the samples generated representative of the target distribution
    \item Sample diversity: Do the generated samples cover the support of the target distribution
    \item Generalization: Is the model learning to generate new samples or merely copying the training set
\end{itemize}

The former 2 have been addressed in some ways with previous scores such as IS, FID (which aggregate the score in one metric) and newer evaluation methods such as Precision/Recall (which separate them into two scores). We propose two methods of evaluating both of these properties using our KDE likelihood based approach. 

Specifically, we can think of sample quality as the likelihood of the generated samples under a model of the likelihood of the test samples and sample diversity as being evaluated through the likelihood of the test set given a model of the likelihood of the generated samples. For the former, if the samples are deemed unlikely by the model from the test set, they are probably poor samples. For the latter, if certains parts of the support of the test set are not covered by the generated samples, then they will have low likelihood undert this model yielding a bad score.


There is an additional body of work which aims to test the degree to which a model overfits the training set. The most promising work, [Meehan, et al] performs a Mann Whitney test on the nearest neighbor distances in the feature space (after PCA projection). While the score does a good job of detecting overfitting in simple toy examples, there has been limited testing of its effectiveness on real world datasets and analysis of potential failure modes. 

\begin{itemize}
    \item Can use KDE to estimate likelihood, hard in high dimensions.
    \item Need to compare test/train likelihood
    \item Can use embeddings as in IS/FID
\end{itemize}
Combine the above can get our score
}

\section{Additional Results}
\subsection{LSUN results}
We conduct the same experiments found in \textbf{Q3} for the LSUN dataset \citep{yu2015lsun}. Specifically, 
we evaluate the FLS on standard generative models and also perform a linear regression analysis with respect to FID. In Fig. \ref{fig:lsun_score_comparison} shows our findings where, for example, we observe that the Improved DDPM model \citep{nichol2021improved} is underrated by FID in comparison to FLS.

\label{app:additional_results}
\begin{figure*}[h!]
\begin{center}
\centerline{\includegraphics[width=\textwidth]{figures/metric_comparison_LSUN_Bedroom.png}}
\caption{FLS vs. FID for LSUN Bedrooms dataset.}
\label{fig:lsun_score_comparison}
\end{center}
\end{figure*}


\subsection{GAN truncation experiments}
We demonstrate the usefulness of a single comprehensive score for hyperparameter selection. For example, as demonstrated by ~\citep{brock2018large}, truncating the latent space variable you sample from for GANs can increase sample quality at the expense of sample diversity. Through FLS, picking the optimal truncation value for a GAN can be done in a way that trades off naturally between precision/recall. Below are the FID/FLS values for various truncation values of StyleGANXL on Imagenet 128x128 (unconditional). Interestingly, the optimal selected value is lower than FID in both cases, indicating once again that FLS seems to value image quality more highly than FID.
\begin{figure*}[h!]
\begin{center}
\centerline{\includegraphics[width=\textwidth]{figures/truncations.png}}
\caption{FLS vs. FID for selecting optimal GAN truncation values.}
\label{fig:gan_truncation_experiment}
\end{center}
\end{figure*}

\subsection{Overfitting and dataset size}
We investigate the impact of dataset size on overfitting. As generative models tend to struggle with small datasets, we start with a pre-trained StyleGAN2-Ada model on FFHQ and fine-tune it on AFHQ (cat), as described in ~\citep{karras2020training}. Specifically, we split the training set of AFHQ (cat) into subsets of varying sizes (500, 1000, 2000, 4000) and fine-tune a pre-trained StyleGAN2-Ada model on each subset, using default parameters. We then get the final checkpoints after 72 hours of training and compute FLS, CTScore and AuthPct on the checkpoint of each dataset size (using as training set the subset and no the whole dataset). 
\begin{figure*}[h!]
\begin{center}
\centerline{\includegraphics[width=\textwidth]{figures/overfit_comparison_AFHQ.png}}
\caption{Overfitting metric values for StyleGAN2-Ada trained on various sizes of AFHQ cat. For \% overfit Gaussians (-50\%), a positive value indicates overfitting while for the other two metrics, a positive value indicates underfitting.}
\end{center}
\end{figure*}

Interestingly, here FLS indicates there is some degree of overfitting whereas the two other metrics signal that the models are underfit. The discrepancy can potentially by the same reason provided for the transforms experiment. Due to the relatively small sizes of the train and test sets, it is possible that the generated samples are far from both which leads CTScore and AuthPct to consider them "underfit". We also investigate qualitatively below, using the overfit Gaussian methodology used previously and find evidence of overfit samples that combine the color of some training samples and the pose of others.

\begin{figure*}[h!]
\begin{center}
\centerline{\includegraphics[width=\textwidth]{figures/overfit_samples_AFHQ.png}}
\caption{Overfit samples on AFHQ. Each row corresponds to the most overfit sample for some model, the leftmost image being the generated sample. The 8 images on the right are the samples in the train set with the highest likelihood for the Gaussian corresponding to that sample. From top to bottom: 500 training samples, 1000 training samples, 2000 training samples, 4000 training samples.}
\end{center}
\end{figure*}

\clearpage
\section{Additional overfit samples}
Below are additional overfit samples detected from various models/datasets.

\begin{figure*}[h!]
\begin{center}
\centerline{\includegraphics[width=\textwidth]{figures/overfit_samples_CIFAR10.png}}
\caption{Overfit samples for CIFAR10. Each row corresponds to the most overfit sample for some model, the leftmost image being the generated sample. The 8 images on the right are the samples in the train set with the highest likelihood for the Gaussian corresponding to that sample. From top to bottom: DDPM, NVAE, DiffStyleGAN2, ImprovedDDPM, StyleGANXL, BigGAN-CR, ReACGAN}
\end{center}
\end{figure*}

\begin{figure*}[h!]
\begin{center}
\centerline{\includegraphics[width=\textwidth]{figures/overfit_samples_ImageNet.png}}
\caption{Overfit samples on ImageNet. Each row corresponds to the most overfit sample for some model, the leftmost image being the generated sample. The 8 images on the right are the samples in the train set with the highest likelihood for the Gaussian corresponding to that sample. From top to bottom: ADM, StyleGANXL, ContraGAN, SNGAN, StyleGAN2, StyleGAN3}
\end{center}
\end{figure*}

\begin{figure*}[h!]
\begin{center}
\centerline{\includegraphics[width=\textwidth]{figures/overfit_samples_LSUNBedroom256.png}}
\caption{Overfit samples on LSUN Bedroom. Each row corresponds to the most overfit sample for some model, the leftmost image being the generated sample. The 8 images on the right are the samples in the train set with the highest likelihood for the Gaussian corresponding to that sample. From top to bottom: DiffProjGAN, DiffStyleGAN2, ADM, ImprovedDDPM, LatentDiff, ProjGAN}
\end{center}
\end{figure*}

\clearpage

\section{Optimization procedure}
Selecting $\sigma_j$ for each generated sample by solving the optimization problem \ref{eq:cross_val} is non-trivial. Even with the simplifying assumption of a diagonal covariance matrix, we still need to learn one parameter per Gaussian (of which there are 10000). As the likelihood of given points comes from the $\log$ of the sum of individual Gaussians, the $\sigma_j$ cannot be optimized independently. Thus, we turn to full batch gradient descent using the Adam optimizer with the following hyperparameters:

\begin{itemize}
    \item 100 steps
    \item $lr=0.5$
    \item Initial value for log variance: $0$
    \item We use a 2-step lr scheduler, reducing the learning rate by a factor of 10 after 50 steps.
\end{itemize}

In addition, getting the likelihood assigned by each Gaussian to each point requires computing the distance between each pair $(\mathbf{x}^{\text{gen}}_i, \mathbf{x}^{\text{train}}_i)$ which is $O(n^2)$ and time-consuming. As the distances and dimensions are large (at least for our high-dimensional experiments), the exponentiation and summations were often numerically unstable. To address this, we:
\begin{itemize}
    \item Compute the $O(n^2)$ distance matrix once and store it so as not to recompute it for each step of the optimization procedure.
    \item Optimize the log variances instead of the variances themselves
    \item Convert $\sigma_j^{-d}$ to $\exp (-d \times \log(\sigma_j)$ to be able to take advantage of a numerically stable logsumexp.
\end{itemize}

From plotting the loss, the variances almost always converged in short order (usually less than 20 steps). While we have no guarantee that these were global minima, when there were exact copies or close to exact copies, \ref{alg:FLS} would recover very low log variances, as shown in \ref{prop1}.

\clearpage
\section{Experimental Setup}
\subsection{Detecting overfitting in low dimensions}
For the overfit KDE experiment, we generate $3000$ points from the Two Moons dataset using sckit-learn ~\citep{pedregosa2011scikit} with a noise value of $0.1$. The first $2000$ points are used as the training set and the last $1000$ as test set. We fit a KDE to the train set using bandwidth values varying from $10^{-4}$ to $10$ and sample $1000$ points as generated samples before computing our score.

As for the GAN, we use a simple $2$ layer fully connected generator and discriminator with ReLU activations. We generate $1000$ samples (less than above to better be able to visually see the difference between the train and test sets) with the same $0.1$ noise value. The first $500$ are used as train set and the last $500$ as test set. We train for $10000$ steps using full batch gradient descent (learning rate of $1e-2$) on the training set with a fixed value of the latent variable $z$ for the generated samples (to encourage it to overfit). Finally, the generated samples of that fixed $z$ are the generated samples.

\subsection{Detecting overfitting in high dimensions}
\xhdr{Data copying} We take $3$ separate batches of $10000$ samples from the CIFAR10 ~\citep{krizhevsky2014cifar} train set and the entire test set. The first is used as "training set" for the purposes of our score computation, the second as validation set and the last as baseline set. The "generated samples" are a mixture of the validation set and increasingly more samples from the "training set" (with a Gaussian noise of $0.1$ added in the feature space). Finally, we get score values with respect to the test set.

\xhdr{Simple transforms}
We use standard torchvision ~\citep{torchvision2016} image transforms applied to the "training set" above and repeat the same process:
\begin{itemize}
    \item \textbf{Horizontal flip}: The image is flipped horizontally.
    \item \textbf{Gaussian blur}: Gaussian blur with a $(5,5)$ kernel and a $\sigma$ of $0.5$.
    \item \textbf{Horizontal Flip}: Color jitter transform with blur in $[0.6, 1.4]$, contrast in $[0.6,1.4]$, saturation in $[0.6, 1.4]$ and hue in $[-0.05, 0.05]$.
    \item \textbf{Center crop 30}: Center crop the image to $(30,30)$ and fill the rest with black.
    \item \textbf{Center crop 24}: Center crop the image to $(24,24)$ and fill the rest with black.
    \item \textbf{Random rotation}: Random rotation between $[0,45]$ degrees.
\end{itemize}

\subsection{Large scale model comparison}
For model comparison, we either use pre-trained networks provided by the respective paper authors OR pre-trained networks provided by StudioGAN ~\citep{kang2022StudioGAN}, a very impressive library that reproduces a large array of GAN models.

We generate 10000 samples for each. Due to the computational requirements of generating samples from diffusion models (often ~2 orders of magnitude higher than GANs), we resort to lowering the number of steps during sampling with aggressive timestep respacing. It is likely that with more steps, diffusion models would achieve even higher FLS values.

\xhdr{CIFAR10 models}
\begin{itemize}
    \item \textbf{StudioGAN} ~\citep{kang2022StudioGAN}: We take the pre-trained models provided \href{https://huggingface.co/Mingguksky/PyTorch-StudioGAN/tree/main/studiogan_official_ckpt/CIFAR10_tailored}{here}. Specifically, if there are multiple training runs, we take the latest one and use the weights from the best checkpoint. The models that were reproduced by StudioGAN that we used are: DCGAN ~\citep{radford2015unsupervised}, WGAN-GP ~\citep{gulrajani2017improved}, SNGAN ~\citep{miyato2018spectral}, SAGAN ~\citep{zhang2019self}, ReACGAN ~\citep{kang2021rebooting}, ProjGAN ~\citep{sauer2021projected}, LSGAN ~\citep{mao2017least}, LOGAN ~\citep{wu2019logan}, BigGAN-CR ~\citep{zhang2019consistency} and ACGAN-Mod ~\citep{kang2021rebooting}.
    \item \textbf{StyleGANXL} ~\citep{sauer2022stylegan}: We use the pre-trained model provided \href{https://s3.eu-central-1.amazonaws.com/avg-projects/stylegan_xl/models/cifar10.pkl
}{here} and generate 10000 samples (seeds $0-10000$).
    \item \textbf{StyleGAN2-ada} ~\citep{karras2020training}: We use the pre-trained model provided \href{https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada/pretrained/cifar10.pkl}{here} with default parameters (seeds $0-10000$).
    \item \textbf{DiffProjGAN and DiffStyleGAN2} ~\citep{wang2022diffusion}: We use the pre-trained model provided \href{https://huggingface.co/zhendongw/diffusion-gan/resolve/main/checkpoints/diffusion-projectedgan-cifar10.pkl}{here} and  \href{https://huggingface.co/zhendongw/diffusion-gan/resolve/main/checkpoints/diffusion-stylegan2-cifar10.pkl}{here} with default parameters (seeds $0-10000$).
    \item \textbf{DDPM} ~\citep{ho2020denoising}: We use the diffusers library ~\citep{von-platen-etal-2022-diffusers} (specifically the default DDIM pipeline) with the model provided \href{https://huggingface.co/google/ddpm-cifar10-32}{here}.
    \item \textbf{ImprovedDDPM} ~\citep{nichol2021improved}: We use the model provided \href{https://openaipublic.blob.core.windows.net/diffusion/march-2021/cifar10_uncond_50M_500K.pt}{here} with $1000$ diffusion steps, a timestep respacing of 100 and DDIM ~\citep{song2020denoising} (otherwise we use default parameters).
    \item \textbf{NVAE} ~\citep{vahdat2020nvae}: We use the checkpoint provided \href{https://drive.google.com/drive/folders/1M1CDwVoV0Ltj0E20yZ8UDY3aSUX284z6}{here} with default parameters.
\end{itemize}


\xhdr{ImageNet models}
\begin{itemize}
    \item \textbf{StudioGAN} ~\citep{kang2022StudioGAN}: We take the pre-trained models provided \href{https://huggingface.co/Mingguksky/PyTorch-StudioGAN/tree/main/studiogan_official_ckpt/CIFAR10_tailored}{here}. Specifically, if there are multiple training runs, we take the latest one and use the weights from the best checkpoint. The models that were reproduced by StudioGAN that we used are: ReACGAN ~\citep{kang2021rebooting}, BigGAN ~\citep{brock2018large}, SNGAN ~\citep{miyato2018spectral}, ContraGAN ~\citep{kang2020contragan}, StyleGAN2 ~\citep{karras2020analyzing}, StyleGAN3 ~\citep{karras2021alias} and SAGAN ~\citep{zhang2019self}.
    \item \textbf{StyleGANXL} ~\citep{sauer2022stylegan}: We use the pre-trained model provided \href{https://s3.eu-central-1.amazonaws.com/avg-projects/stylegan_xl/models/imagenet128.pkl
}{here} and generate 10000 samples with default parameters (seeds $0-10000$).
    \item \textbf{ADM} ~\citep{dhariwal2021diffusion}: We use the checkpoint provided \href{https://openaipublic.blob.core.windows.net/diffusion/jul-2021/128x128_diffusion.pt}{here} with $1000$ diffusion steps, a timestep respacing of 100 and DDIM (otherwise we use default parameters).
\end{itemize}

\xhdr{LSUN models}
\begin{itemize}
    \item \textbf{ProjGAN} ~\citep{sauer2021projected}: We use the pre-trained models provided \href{https://s3.eu-central-1.amazonaws.com/avg-projects/projected_gan/models/bedroom.pkl 
}{here} with default parameters (seeds $0-10000$).
    \item \textbf{DiffProjGAN and DiffStyleGAN2} ~\citep{wang2022diffusion}: We use the checkpoints provided \href{https://huggingface.co/zhendongw/diffusion-gan/resolve/main/checkpoints/diffusion-projectedgan-lsun-bedroom.pkl}{here} and \href{https://huggingface.co/zhendongw/diffusion-gan/resolve/main/checkpoints/diffusion-stylegan2-lsun-bedroom.pkl}{here} with default parameters (seeds $0-10000$).
    \item \textbf{ADM} ~\citep{dhariwal2021diffusion}: We use the checkpoint provided \href{https://openaipublic.blob.core.windows.net/diffusion/jul-2021/lsun_bedroom.pt}{here} with $1000$ diffusion steps, a timestep respacing of 100 and DDIM (otherwise we use default parameters).
    \item \textbf{ImprovedDDPM} ~\citep{nichol2021improved}: We use the checkpoint provided \href{https://openaipublic.blob.core.windows.net/diffusion/jul-2021/128x128_diffusion.pt}{here} with $1000$ diffusion steps, a timestep respacing of 100 and DDIM (otherwise we use default parameters).
    \item \textbf{Latent Diffusion} ~\citep{rombach2022high}: We use the checkpoint provided \href{https://ommer-lab.com/files/latent-diffusion/lsun_bedrooms.zip}{here} with $200$ diffusion steps, DDIM and an $\eta$ of $1$.
\end{itemize}

\clearpage
\section{Proof of Proposition~\ref{prop1}}
We now prove our Proposition \ref{prop1} from the main paper.
\label{app:proof}
\propx*
\begin{proof}
Let us consider $\mathbf{x}^{\text{gen}}_k \in \{ 
\mathbf{x}^{\text{gen}}_i\}_{j=1}^m \cap \{\mathbf{x}_{i}^{\text{train}}\}_{i=1}^n$, we have that there exists $l$ such that $\mathbf{x}^{\text{gen}}_k=\mathbf{x}^{\text{train}}_l$ thus
\begin{align*}
   &\sum_{i=1}^{n} \log \sum_{j=1}^{m} 
  \frac{\exp \Big({\tfrac{-||\varphi(\mathbf{x}^{\text{gen}}_j)-\varphi(\mathbf{x}^{\text{train}}_i)||^2}{2 \sigma_j^2}}\Big)}{\sigma_j^{d}}\\
 &\geq  \frac{\exp \Big({\tfrac{-||\mathbf{x}^{\varphi(\text{gen}}_k)-\varphi(\mathbf{x}^{\text{train}}_l)||^2}{2 \sigma_j^2}}\Big)}{\sigma_k^{d}} \\
 &= 
 \sigma_k^{-d}  \underset{\sigma_k \to 0}{\longrightarrow} \infty
\end{align*}
\vspace{-25pt}
\end{proof}

\cut{

\subsection{Truncation experiments}

\clearpage

\section{Feature spaces}
}



\section{Conclusion}
\looseness=-1
In this paper, we introduce \textbf{FLS}, a new holistic evaluation metric for deep generative models that encompasses many of the key desiderata for a sound evaluation. Specifically, FLS is easy to compute, broadly applicable to all generative models, and evaluates generation quality, diversity, as well as generalization. Moreover, we show that, unlike previous approaches, FLS provides more insights into trained generative models while being universally applicable to all model families. We empirically demonstrate both on synthetic and real-world datasets that FLS can diagnose important failure modes such as memorization/overfitting---informing practitioners on the potential limitations of generative models that generate photo-realistic images. While we focused on the domain of natural images, a fertile direction for future work is to extend FLS to other data modalities such as text, audio, or time series and also evaluating conditional generative models.  





\section{Experiments}
\label{sec:experiments}
\looseness=-1
We investigate the application of FLS on generative models that span a broad category of model families, including popular GAN, Diffusion, and VAE-based generative models. For datasets, we train our models on both toy datasets such as Two Moons \citep{pedregosa2011scikit}, as well as popular natural image, benchmarks in CIFAR10 \citep{krizhevsky2014cifar}, Imagenet \citep{deng2009imagenet}, LSUN \citep{yu2015lsun}, and AFHQ \citep{choi2020stargan}.

\looseness=-1
\xhdr{Baselines}
Throughout our experiments, we rely on three representative baseline metrics to evaluate generative models: FID, $C_T$ score \citep{meehan2020non}, and AuthPct. These baselines all have the benefit of being sample-based evaluation metrics and allow for a fair comparison with FLS. The $C_T$ is a Mann-Whitney test on the distribution of distances between generated and train samples compared to the distribution of distances between train samples and test samples (negative implies overfit, positive implies underfit). The AuthPct $\in [0,100]$, derived from authenticity described in \citep{alaa2022faithful}, is simply the percentage of generated samples deemed authentic by their metric (i.e., whose distance to their nearest neighbor in the train set is larger than the distance between that nearest neighbor and its nearest neighbor in the train set).

\looseness=-1
Our experiments seek to answer the following questions:
\begin{enumerate}[label={(\bf Q\arabic*)}, topsep=0pt, parsep=0pt, leftmargin=25pt, itemsep=2pt]
    \item \textbf{Detecting overfitting in low dimensions}. Does FLS detect overfitting in low-dimensional settings?
    \item \textbf{Detecting overfitting in high dimensions}. Does FLS detect overfitting in high-dimensional settings?
    \item \textbf{Difference with existing metrics}. How does FLS compare to existing evaluation metrics?
\end{enumerate}



\subsection{Detecting overfitting in low dimensions (Q1)}

\looseness=-1
\xhdr{Overfitting on Two Moons}
We consider two generative models trained on the Two Moons dataset. The first model is a KDE with the kernel means set to the training set points (visualized in Fig. \ref{fig:density_fit}), but the bandwidth $\sigma$ varies from low to high. For our second model, we train a simple GAN with minor modifications, e.g., large batch sizes, to encourage it to overfit. In Fig. \ref{fig:basic_experiments}, we plot our FLS and all baselines throughout training. As expected, the FLS is near zero for low levels of $\sigma$, indicating a high degree of overfitting. However, as $\sigma$ increases, so does FLS as the samples become less and less overfit before finally starting to drop as the problem shifts from overfitting to underfitting. As such, FLS matches the ability of the $C_T$ score and AuthPct, to discern overfitting when $\sigma$ is low while also providing an evaluation of sample quality as $\sigma$ increases. Similarly, for the GAN model, we find that FLS provides a good measure of the improvement in sample quality as the GAN learns the distribution. Eventually, as the GAN begins to overfit, FLS decreases (even though sample quality and diversity remain high).



\begin{figure}[h!]
\vspace{-10pt}
\centerline{\includegraphics[width=1.1\linewidth, height=1.2in]{figures/basic_experiments.png}}
\vspace{-10pt}
\caption{Comparison of metrics on synthetic examples.}
\label{fig:basic_experiments}
\vspace{-15pt}
\end{figure}

\subsection{Detecting overfitting in high dimensions (Q2)}
\looseness=-1
\xhdr{Data Copying on CIFAR10}
We now study the effect of exact-data copying on CIFAR10, a high-dimensional setting that necessitates mapping inputs to a feature space. We construct a synthetic-generated set with ``perfect'' quality and diversity but with some memorized training examples. We do so by creating a mixture of the validation set and the training set. We add a small amount of noise to the features of the copied training examples as perfectly exact copies are unlikely. Intuitively, increasing the proportion of training examples in our synthetic-generated set corresponds to more overfitting/memorization. We report our results in Fig.~\ref{fig:basic_experiments} and observe that FLS decreases as the number of copied/memorized examples increases. The less drastic decrease relative to other metrics is due to the impact of sample quality. Indeed, even with a high percentage of copied samples, the generated samples still contain a large amount of high-quality/diversity samples.
\begin{figure}[ht]
\centerline{\includegraphics[width=\columnwidth]{figures/indexed_visualized_transformations.png}}
\vspace{-15pt}
\caption{Visualized examples of the transformations.}
\label{fig:image_transformations}
\vspace{-5pt}
\end{figure}

\begin{figure*}[ht]
\centerline{\includegraphics[width=\textwidth]{figures/overfit_comparison.png}}
\vspace{-10pt}
\caption{Overfitting score comparison for CIFAR10 and Imagenet generative models. \textbf{Left:} Computing ${\mathcal{O}}$ after standardization by subtracting by $50\%$. \textbf{Mid and Right:} $C_T$ score and AuthPct (also $-50\%$) on various deep generative models.}
\label{fig:cifar10_and_imagenet_overfit_comparison}
\vspace{-5pt}
\end{figure*}
\begin{figure}[H]
\centerline{\includegraphics[width=\columnwidth]{figures/FLS_fig_5.pdf}}
\caption{When generated samples are far from both the train and test set (relative to the distance between the train and test sets) but closer to the train set, overfitting is identified by FLS but not by the baselines.}
\vspace{-5pt}
\label{fig:score_discrepancy_explanation}
\end{figure}

\looseness=-1
\xhdr{Transformed Data Copying on CIFAR10}
We now consider a slightly more realistic scenario than nearly exact data copying by applying image transformation to each copied example. We experiment with popular data augmentation techniques---e.g., cropping, blurring, color jitter, and rotations---as our main image transformation methods (see. Fig.~\ref{fig:image_transformations}). Tab.~\ref{tab:image_transformation_results} summarizes our findings.
For all transformations, FLS correctly decreases as there are more copied training samples in the pseudo-generated. However, we find the baselines fail to detect overfitting for transformations center crop and random rotation. This failure to capture overfitting can, in part, be explained by both of these scores examining whether generated samples are too close to the training set and not whether they are closer to the training set than they are to the test set (see Fig. \ref{fig:score_discrepancy_explanation}).



 










\begin{table}[t]
\vspace{-2mm}
\caption{ \small
Detection of overfitting by various scores for a model that produces ${\mathcal{D}}_{\text{transformed}}$, a set of transformed copies of the training set. $\Delta \texttt{score} = \texttt{score}({\mathcal{D}}_{\text{baseline}})-\texttt{score}({\mathcal{D}}_{\text{transformed}})$. 
}
\label{tab:image_transformation_results}
\begin{center}
\begin{sc}
\scriptsize
\begin{tabular}{lcccr}
\toprule
Transformation & $\Delta$FLS $\uparrow$& $\Delta$CTScore $\uparrow$ & $\Delta$AuthPct $\uparrow$  \\
\midrule
Horizontal Flip (a) &  66.8 \:\: {\color{OliveGreen}\cmark} &  71.5 \:\: {\color{OliveGreen}\cmark} &   72.8 \:\: {\color{OliveGreen}\cmark} \\
Gaussian Blur (b)  &  65.3 \:\: {\color{OliveGreen}\cmark} &  66.9 \:\: {\color{OliveGreen}\cmark} &  72.6 \:\: {\color{OliveGreen}\cmark} \\
Color Jitter (c)    &  25.4 \:\: {\color{OliveGreen}\cmark} &  48.5 \:\: {\color{OliveGreen}\cmark} & 65.0 \:\: {\color{OliveGreen}\cmark} \\
Center Crop 30 (d)  &  19.3 \:\: {\color{OliveGreen}\cmark} &  20.1 \:\: {\color{OliveGreen}\cmark} & 46.7  \:\: {\color{OliveGreen}\cmark} \\
Center Crop 24 (e)  &  21.0 \:\: {\color{OliveGreen}\cmark} &  -21.5 \:\: {\color{red}\xmark} &  -0.5 \:\: {\color{red}\xmark} \\
Random Rotation (f) &  17.3 \:\: {\color{OliveGreen}\cmark} &  -16.6 \:\: {\color{red}\xmark} &  -6.6 \:\: {\color{red}\xmark} \\
\bottomrule
\end{tabular}
\end{sc}
\end{center}
\vspace{-4pt}
\end{table}










\xhdr{Detecting overfitting}
We now investigate the overfitting capabilities of deep generative models on natural image datasets. We start by attempting to disentangle the impact of overfitting on FLS by examining each fitted Gaussian and computing the difference between the train and test log-likelihoods under each Gaussian,
\begin{equation}
{\mathcal{O}}_i = \log {\mathcal{N}}({\mathcal{D}}_{\text{train}}|x_i; \hat{\sigma_i}^2I) - \log {\mathcal{N}}({\mathcal{D}}_{\text{test}}|x_i; \hat{\sigma_i}^2I).
\end{equation}
\looseness=-1
If ${\mathcal{O}}_i$ is positive, we deem the Gaussian overfit. Calculating the percentage of overfit Gaussians gives us a proxy measure of the impact of overfitting on the final FLS value. We expect FLS $\approx 50$\% when there is no overfitting and rise linearly with the amount of overfitted samples. Quantitatively, we perform a large-scale evaluation of overfitting in \ref{fig:cifar10_and_imagenet_overfit_comparison}, evaluating ${\mathcal{O}}_i$, $C_T$ score and AuthPct on a variety of models and datasets. Interestingly, StyleGANXL~\citep{sauer2022stylegan} is deemed to overfit on both datasets, an observation that the $C_T$ score and AuthPct corroborates. In addition, using FLS, we find varying degrees of overfitting behavior for other standard generative models in a similar fashion as $C_T$ score and AuthPct.

We also qualitatively evaluate these findings by comparing the discrepancy between training and test likelihoods under each Gaussian---i.e., ${\mathcal{O}}_i$---in the generated samples. By ranking them, we achieve an ordering over the generated samples that have the highest to lowest contribution to overfitting. Using this as a tool, we visualize the overfitting behavior of standard deep generative models on CIFAR10 and Imagenet in Fig.~\ref{fig:cifar10_and_imagenet_overfit_comparison}. Interestingly, we find different candidates for ``overfit'' samples than simply looking at nearest-neighbors (as depicted in \ref{fig:overfit_samples_using_FLS}). It is likely that an overfit sample will resemble a bunch of different training samples, borrowing a bit from each instead of just being close to one of them, a subtlety that is captured by FLS and not nearest-neighbor-based evaluations.



\begin{figure*}[h!]
\begin{center}
\centerline{\includegraphics[width=\textwidth, height=2.9in]{figures/updated_overfit_vs_nn.png}}
\vspace{-10pt}
\caption{Overfit samples detected using FLS vs Nearest Neighbors, illustrating the more subtle overfitting captured by FLS. (1) is the sample corresponding to the most overfit gaussian while (2) consists of the train samples with the highest likelihood under (1). We compare with (3), the generated sample closest to the training set in $\ell_2$ distance, while (4) is the $3$ closest training sample. (a) is from BigGAN-CR on CIFAR10, (b) is from DiffStyleGAN2, (c) is from StyleGANXL on Imagenet, and (d) is from ContraGAN.}
\label{fig:overfit_samples_using_FLS}
\end{center}
\vspace{-5pt}
\end{figure*}

\begin{figure*}[h!]
\vspace{-5pt}
\begin{center}
\centerline{\includegraphics[width=\textwidth, height=2.9in]{figures/metric_comparison_CIFAR10.png}}
\vspace{-10pt}
\caption{FLS vs FID score comparisons for CIFAR10 and Imagenet.}
\label{fig:imagenet_score_comparison}
\end{center}
\vspace{-15pt}
\end{figure*}

\cut{
\begin{figure*}[h!]
\includegraphics[width=0.24\linewidth]{figures/stylegan2_training_800.pdf}
\includegraphics[width=0.24\linewidth]{figures/stylegan2_training_1600.pdf}
\includegraphics[width=0.24\linewidth]{figures/stylegan2_training_6400.pdf}
\includegraphics[width=0.24\linewidth]{figures/stylegan2_training_12800.pdf}
\vspace{-10pt}
\caption{StyleGAN 2 evaluated using FLS, FID, and $\%$ Overfit Gaussians versus training dataset size over the course of training. From left to right we use batch sizes of $800, 1600, 6400, 12800$ respectively.}
\label{fig:stylegan2_training_dataset_size}

\end{figure*}
}

\cut{
\xhdr{Detecting overfit samples}
Using our FLS score, it is possible to qualitatively inspect suspected overfitting samples in the following manner: since each Gaussian in the MoG used in FLS is initialized using the generated samples, it is possible to compute the likelihoods of train and tests under each Gaussian independently---i.e., we treat one Gaussian in the MoG as the generative model and use that to compute the train and test likelihood. By comparing the discrepancy between training and test likelihoods under each Gaussian in the generated samples and then ranking them, we achieve an ordering over the generated samples that have the highest to lowest contribution to overfitting. This notion of overfitting samples is more nuanced than a simple nearest neighbor test, capturing samples that are too close to multiple training images rather than just one (as depicted in \ref{fig:overfit_samples_using_FLS}). It is likely that an overfit sample will resemble a bunch of different training samples, borrowing a bit from each instead of just being close to one of them.




This method of detecting overfit samples provides a more nuanced analysis than typical nearest-neighbor comparisons (even if they are done in a suitable feature space). Even if there might be samples that are even closer to their training set equivalent, our method looks at the ensemble of samples. It is quite likely that an overfit sample will resemble a bunch of different training samples, borrowing a bit from each instead of just being close to one of them.
}

\looseness=-1
\subsection{Difference and Correlation with FID (Q3)}
\label{sec:exp_q3}
We now investigate the relationship between FLS and FID, performing a large-scale comparison of the same set of models and datasets. For CIFAR 10 and Imagenet, we plot FLS vs. FID in Fig \ref{fig:imagenet_score_comparison} respectively and conduct a simple linear regression analysis. A corresponding plot for LSUN is presented in Appendix~\S\ref{app:additional_results}. Overall, we find evidence of a strong correlation between FLS and FID for all datasets, which highlights that FLS captures sample quality/diversity as well. However, despite this correlation, by inspecting the residuals of a standard linear regression between FID and FLS, we notice a few clear outliers. In particular, we find that certain models obtain too high of an FID score given their FLS (i.e., ``overrated" by FID). 

\looseness=-1
For example, we find that NVAE \citep{vahdat2020nvae} is overrated by FID. We reconcile this fact by noting that NVAE samples tend to be more diverse but of lower quality. The discrepancy in scores indicates that, from a likelihood perspective, FLS values sample quality more than FID. We also uncover models that FID underrates. For instance, on CIFAR 10, we find that DCGAN \citep{radford2015unsupervised} is underrated, which we hypothesize is due to the simplicity of the model and the fact that it predates the FID metric and is thus not the result of DCGAN trying to optimize for FID. 

\section{Evaluation of Generative Models}
Given a training dataset $\mathcal{D}_{\text{train}} = \{\mathbf{x}_i\}^n_{i=1}$ drawn from a distribution $p_{\text{data}}$, one of the key objectives of generative modeling is to train a parametric model $g$ that is able to generate novel synthetic yet high-quality samples---i.e., the distribution $p_g$ induced by the generator is close to $p_{\text{data}}$.\footnote{By close we often refer to either a divergence between the two distributions (e.g. KL, JSD) or a distance metric like Wasserstein.}

While some lines of work can evaluate generative models via their density, such an evaluation technique is only possible when one has access to the exact density in a computationally efficient manner of the considered generative model \citep{song2021maximum}. Moreover, it has been argued that the likelihood of high-dimensional inputs as an evaluation metric for generative models has some major flaws~\citep{theis2015note,nowozin2016f,le2021perfect}.

\looseness=-1
As all deep generative models of interest are capable of producing samples, an effective way to evaluate these models is via their samples. Such a strategy has the benefit of bypassing the need to compute the exact density of a sample point by the model---thus allowing for a unified setting to evaluate all models. More precisely, given ${\mathcal{D}}_{\text{gen}} = \{ \mathbf{x}^{\text{gen}} \}_{i=1}^m$ generated samples, where each $\mathbf{x}^{\text{gen}} \sim p_g$ and ${\mathcal{D}}_{\text{test}} = \{ \mathbf{x}^{\text{test}} \}_{i=1}^n$ drawn from $p_{\text{data}}$,  the goal is to evaluate how ``good" the generated samples are with respect to the real data distribution. 




\subsection{Sample-based metrics} 
\looseness=-1
Historically, sample-based metrics for evaluating deep generative models have been based on two ideas: 1) using an Inception network \citep{szegedy2016rethinking} backbone $ \varphi$ as a feature extractor to 2) compute a notion of distance (or similarity) between the generated and the real distribution. The Inception Score (IS) and the Fréchet Inception Distance are the two most popular examples and can be computed as follows:
\begin{align}
    &\text{IS:} \quad \exp\Big(\frac{1}{m} \sum_{i=1}^m KL(p_\varphi(y| \mathbf{x}^{\text{gen}}_i)||p_{d}(y)\Big) \\
    &\text{FID:}\quad  \|\mu_g-\mu_p\|^2
    + \Tr(\Sigma_g + \Sigma_p - 2(\Sigma_g\Sigma_p)^{1/2})
\end{align}
where $p_\varphi(y|x)$ is the probability of each class given by the Inception network $\varphi$, $p(y)$ is the ratio of each class in the real data, $\mu_g := \frac{1}{m}\sum_{i=1}^m \varphi(\mathbf{x}^{\text{gen}}_i), \mu_p := \frac{1}{n} \sum_{i=1}^n \varphi(\mathbf{x}^{\text{test}}_i)$ are the empirical means of each distribution, and $\Sigma_g:= \frac{1}{m} \sum_{i=1}^m (\mathbf{x}^{\text{gen}}_i - \mu_g)(\mathbf{x}^{\text{gen}}_i -\mu_g)^\top, \Sigma_p := \frac{1}{n}\sum_{i=1}^n (\mathbf{x}^{\text{test}}_i - \mu_p)(\mathbf{x}^{\text{test}}_i-\mu_p)^\top$ are the empirical covariances.

\looseness=-1
The popularity of IS and FID as evaluation metrics for generative models is motivated by their correlation with perceptual quality and diversity. More recently, other metrics such as KID \citep{binkowski2018demystifying} (an unbiased version of FID) and precision/recall (which disentangles sample quality and distribution coverage)~\citep{sajjadi2018assessing} have followed in their steps and added additional nuance to generative model evaluation.

However, all the above metrics share the same failure mode: overfitting is not detected. In fact, a trivial generative model that merely memorizes the training set ${\mathcal{D}}_{\text{train}}$ would get a SOTA score (within statistical uncertainty). This happens regardless of whether if it was compared to the training set or the test set. Indeed, the distance between the training set and the test set is negligible compared to the distance between generated samples and the data distribution.

\subsection{Feature Likelihood Score}
\label{sec:FLS}
We now introduce our Feature Likelihood Score (FLS) which is predicated on the belief that a proper evaluation measure for generative models should go beyond sample quality and also inform practitioners of the generalization capabilities of their trained models. While previous sample-based methods have foregone density estimation in favor of computing distances between sample statistics, we seek to bring back a likelihood-based approach to evaluating generative models. To do so, we first propose our method for overfitting a mixture of Gaussians (MoGs). Through it, we can estimate the \emph{perceptual} density of high-dimensional samples in a way that accounts for \emph{overfitting}---i.e., samples that are closer to the training data than the test data. Specifically, our method aims at attributing 1) a good density to high-quality, non-overfit images and 2) a very large density to images that have been memorized. 


\subsubsection*{Overfitting Mixtures of Gaussians}
\looseness=-1
Our method consists of a simple sample-based density estimator amenable to a variety of data domains inspired by a traditional mixture of Gaussians (MoG) density estimator with a few key distinctions. It consists of the following steps:
\begin{enumerate}[itemsep=0mm]
    \item Map the samples to a chosen feature space.
    \item Initialize a mixture of Gaussians (MoG) using each mapped sample as a Gaussian.
    \item Select individual $\sigma^2_j$ for each Gaussian using a set of training samples (also in the feature space).
    \item Evaluate log-likelihood of MoG on the test set.
\end{enumerate}

\looseness=-1
\xhdr{Step 1: Map to feature space}
The first change we make is to map the inputs to some perceptually meaningful feature space. A natural choice for this is the representation space of Inception-v3 network but given recent criticisms ~\citep{kynkaanniemi2022role}, we also experiment with CLIP features. While the resulting data is still high-dimensional (e.g. $d=2048$ or $d=512$) we ensure that a larger proportion of dimensions are useful and that the resulting $\ell_2$ distances between images are more meaningful.

\looseness=-1
\xhdr{Step 2: Model the density using an MoG} As in Kernel Density Estimation (KDE), to estimate a density from some set of points $ 
{\mathcal{D}}_{\text{gen}} = \{\mathbf{x}^{\text{gen}}_j\}_{j=1}^m$ we center an isotropic Gaussian around each point---i.e., the mean of the Gaussian is the coordinates of the point. This means that $j$-th data point has a Gaussian $\mathcal{N}( \varphi(\mathbf{x}_j^{\text{gen}}),\,\sigma_j^2 I_d)$.
Then, to compute the likelihood of a new point $\mathbf{x}$, we simply calculate the mean likelihood assigned to that point by all Gaussians in the mixture:
\begin{equation}
p_\sigma(\mathbf{x}) = \frac{1}{m} \sum_{j=1}^m \mathcal{N}( \varphi(\mathbf{x}) | \varphi(\mathbf{x}_j^{\text{gen}}),\,\sigma_j^2 I_d)
\label{eq:MoG_density}
\end{equation}
with the convention that $\mathcal{N}( \varphi(\mathbf{x}) | \varphi(\mathbf{x}^{\text{gen}}),\,0_d)$ is a dirac at $\varphi(\mathbf{x}^{\text{gen}})$. Henceforth, we denote this MoG estimator which has fixed centers initialized to a dataset (e.g. train set, generated set) as ${\mathcal{N}}(\varphi({\mathcal{D}}); \Sigma)$, where $\Sigma$ is a diagonal matrix of bandwidths parameters---i.e. $\mathbf{\sigma}^2I$, where $\mathbf{\sigma}^2$ is a vector.

\looseness=-1
\xhdr{Step 3: Use the train set to select $\sigma^2_j$}
An important question in kernel density estimation is selecting an appropriate bandwidth $\sigma_j^2$. Overwhelmingly, a single bandwidth is selected which can either be derived statistically or by minimizing some loss through cross validation~\citep{murphy2012machine}. We depart from this single bandwidth philosophy in favor of separate $\sigma_j^2$ values for each Gaussian. To select $\sigma_j^2$, instead of performing standard cross-validation on samples from $p_g$, we fit the bandwidths using a subset of training examples $\{\varphi(\mathbf{x}_{i}^{\text{train}})\}_{i=1}^n$ by minimizing their negative log-likelihood. Specifically, we solve the following optimization problem:
\begin{equation}
\hat \sigma \in \arg\max_{\mathbf{\sigma^2}} 
\sum_{i=1}^{n} \log \sum_{j=1}^{m} 
 \sigma_j^{-d} \exp \Big({\tfrac{-||\varphi(\mathbf{x}^{\text{gen}}_j)-\varphi(\mathbf{x}^{\text{train}}_i)||^2}{2 \sigma_j^2}}\Big)
 \label{eq:cross_val}
\end{equation}
\looseness=-1
We motivate using a subset of $\{\mathbf{x}_{i}^{\text{train}}\}_{i=1}^n$ for bandwidth selection as for each element of the training set copied by the generative model, the associated $\sigma_j^2$ is vanishing. The following proposition (proof in \S\ref{app:proof}) formalizes this intuition.
\begin{mdframed}[style=MyFrame2]
\begin{restatable}{proposition}{propx}
\label{prop1}
For each $\mathbf{x}^{\text{gen}}_k \in \{\mathbf{x}_{i}^{\text{train}}\}_{i=1}^n$ we have that $\hat \sigma_j^2 = 0$ where $\hat \sigma^2$ is a solution of~\eqref{eq:cross_val}.
\end{restatable}
\end{mdframed}
\looseness=-1
Proposition \ref{prop1} implies that each element of the training set that has been memorized induces a Dirac in the MoG density~\eqref{eq:MoG_density}. Thus, the learned density is able to identify copying of training samples. More generally, if one of the generated samples is unreasonably close to a training sample, its associated $\sigma^2$ will be very small as this maximizes the likelihood of the training sample. We illustrate this phenomenon with the Two-Moons dataset~\citep{pedregosa2011scikit} in Figure~\ref{fig:density_fit}. Note that since this dataset is low-dimensional, we do not need to use a feature extractor (Step 1). In Figure~\ref{fig:density_fit} we can see that the more approximate copies of the training set appear in the generated set, the more the estimated density (using~\eqref{eq:cross_val}) contains high values around ``copies'' of the training set. As such, overfit generated samples yield an overfit MoG that does not model the distribution of real data $p_{\text{data}}$ and will yield poor (i.e., low) log-likelihood on the test set ${\mathcal{D}}_{\text{test}}$. 
\looseness=-1
\begin{figure*}[ht]
    \centering
    \includegraphics[width=\linewidth, height=1.5in]{figures/MoG.pdf}
    \vspace{-8mm}
    \caption{Estimated density (in {\color{Violet}purple}) of the generated distribution using an MoG centered at the generated samples $\mathbf{x}^{\text{gen}}_i$ (in {\color{blue}blue})~\eqref{eq:MoG_density}. The selection of $\sigma_i^2$ is done via~\eqref{eq:cross_val}. The training points $\mathbf{x}^{\text{train}}_i\sim p_d$, sampled from the two-moons dataset, are represented in {\color{orange}orange}. The generated points correspond to $k$ approximates copies of the training set $\mathbf{x}^{\text{gen}}_i = \mathbf{x}^{\text{train}}_i + \mathcal{N}(0,10^{-4})\,,\, i=1,\ldots,k$ and $200-k$ independent samples from the data distribution $\mathbf{x}^{\text{gen}}_i \sim p_{d}, i=k+1,\ldots,200$. The dark areas correspond to high-density values.}
    \label{fig:density_fit}
    \vspace{-5pt}
\end{figure*}
\looseness=-1

\xhdr{Step 4: Evaluate MoG density} To get a quantitative evaluation of the density obtained in Step 3, we evaluate the likelihood of the ${\mathcal{D}}_{\text{test}}$ under $p_{\hat \sigma}(\mathbf{x})$. As demonstrated in Figure~\ref{fig:density_fit}, in settings with $k > 0 $, the generated samples are too close to the training set meaning that all test samples will have a low likelihood (as they are far from the center of Gaussians with low variances). Evaluation of the test set provides a succinct way of measuring the generalization performance of our generative model which is a key aspect that is lost in metrics such as IS and FID.

It is important to note that while it is indeed possible to train any other density model, an MoG offers a favorable tradeoff in being simple and scalable to large datasets while being highly interpretable as we optimize for each $\sigma^2_j$. Furthermore, a MoG density estimator is universal \citep{nguyen2020approximation}.

\cut{
\begin{algorithm}[tb]
   \label{alg:FLS}
   \caption{: Fit OGM}
    \begin{algorithmic}
       \STATE {\bfseries Input:} $x^{\text{train}}$,  $x^{\text{centers}}$, $\varphi$
       
       \STATE $x^{\text{train}} \leftarrow \varphi(x^{\text{train}})$
       \STATE $x^{\text{centers}} \leftarrow \varphi(x^{\text{centers}})$
       \STATE $\mathbf{\sigma^2} \leftarrow 1$

       \STATE // We compute and store the $N \times M$ matrix of pairwise L2 distances so it doesn't need to be recomputed during the optimization loop
       \STATE $\text{dists} \leftarrow d(x^{\text{train}}, x^{\text{centers}})$
       
       \STATE Initialize $OGM$ with $x^{\text{centers}}, \mathbf{\sigma^2}$
       
       \WHILE{$\mathbf{\sigma^2}$ not converged}
       \STATE // Gradient descent on the NLL of the train data
           \STATE $\mathbf{\sigma^2}\leftarrow \nabla_{\mathbf{\sigma^2}} OGM(x^{\text{train}})$
       \ENDWHILE
       
       \STATE // Return the fit $OGM$ and the final NLL of the train set
    \STATE \textbf{return} $OGM, OGM(x^{\text{train}})$
    \end{algorithmic}
\end{algorithm}
}
\begin{figure}[t]
\vspace{-5mm}
\begin{algorithm}[H]
   \small 
   \textbf{Inputs:} $ \hat{{\mathcal{D}}}_{\text{train}}, {\mathcal{D}}, \varphi, \alpha $
   \newline
   \xhdr{Train} Fit MoG on $\hat{{\mathcal{D}}}_{\text{train}}$ with  $\mu = \varphi({\mathcal{D}})$
    \begin{algorithmic}[1]
      
       \STATE $\Sigma = \mathbf{\sigma}^2I$ \hfill\COMMENT{// Initialize all bandwidths $\sigma^2_j = 1$}
       \STATE $ \mathbb{H} = d(\varphi(\hat{{\mathcal{D}}}_{\text{train}}), \varphi({\mathcal{D}}))$ \hfill\COMMENT{// Pre-compute distance matrix.}
       \WHILE{$\mathbf{\sigma}^2$ not converged}
       \STATE ${\mathcal{L}} = -\log {\mathcal{N}}(\varphi(\hat{{\mathcal{D}}}_{\text{train}}) | \varphi({\mathcal{D}}); \Sigma)$
           \STATE $\mathbf{\sigma}^2\leftarrow \sigma^2 - \alpha \nabla_{\mathbf{\sigma}^2} {\mathcal{L}}$ \hfill\COMMENT{// Gradient descent on NLL }
       \ENDWHILE
       \STATE $\hat{\Sigma} \leftarrow \mathbf{\hat{\sigma}}^2 I$ \hfill\COMMENT{// $\mathbf{\hat{\sigma}}^2$ is the bandwidth at convergence}
   
    \STATE \textbf{return} ${\mathcal{N}} (\varphi({\mathcal{D}}); \hat{\Sigma})$ \hfill\COMMENT{// Return Trained MoG}
    \end{algorithmic}
    \caption{ Fitting MoGs for \textsc{FLS}}
    \label{alg:FLS}
\end{algorithm}
\vspace{-5mm}
\end{figure}


\begin{figure}[ht]
\includegraphics[width=1.0\linewidth]{figures/FLS_flow_diagram.pdf}
\vspace{-15pt}
\caption{Computation procedure for FLS.}
\label{score_computation_both}
\vspace{-5pt}
\end{figure}

\subsubsection*{FLS Computation} 
Now that we've established our method of density estimation for high-dimensional settings, we detail how to use it to compute our FLS score. 

FLS is designed to provide a single score that takes into account \textbf{sample quality} (IS, precision, etc.), \textbf{sample diversity} (FID, recall) while also punishing generative models that \textbf{overfit}. To compute our FLS score we use $3$ sets of samples: the training set (20000 samples), the test set (10000 samples), and a set of generated samples (10000 samples). 

Unless indicated otherwise, for our experiments we used the number of samples given in parentheses. We experimented with larger amounts of samples but this had a minimal impact on score values. We then map all the samples to the chosen feature space (Inception v3 or CLIP).\footnote{FLS can theoretically be used in any domain with any method for mapping inputs to a meaningful feature space. An investigation of this is beyond the scope of this paper.} Once mapped, we normalize the features to have 0 mean and unit variance before computing our scores.


We start by taking the training set and splitting it randomly into two even subsets, ${\mathcal{D}}_{\text{train}} =  \hat{{\mathcal{D}}}_{\text{train}} \cup {\mathcal{D}}_{\text{baseline}}$. The first half $ \hat{{\mathcal{D}}}_{\text{train}}$ will be used to fit the MoG density estimator while the second half ${\mathcal{D}}_{\text{baseline}}$ will act as our baseline. Assuming that the training set and testing set are drawn i.i.d. from $p_{\text{data}}$, the second subset ${\mathcal{D}}_{\text{baseline}}$ serves as a reasonable proxy for samples generated by a perfect generative model. We then fit two MoG density estimators to $ \hat{{\mathcal{D}}}_{\text{train}}$. The first, ${\mathcal{N}}(\varphi({\mathcal{D}}_{\text{gen}}); \hat{\Sigma}_{\text{gen}})$, uses the generated samples as centers while the second, ${\mathcal{N}}(\varphi({\mathcal{D}}_{\text{baseline}}); \hat{\Sigma}_{\text{baseline}})$ uses the second half of the train data as centers---i.e. $\varphi({\mathcal{D}}_{\text{baseline}})$. The training procedure for the MoG is outlined in Algorithm \ref{alg:FLS}. Finally, we evaluate the log-likelihood of ${\mathcal{D}}_{\text{test}}$ under both MoGs which we denote as $nll_{\text{gen}} = \log {\mathcal{N}}(\varphi({\mathcal{D}}_{\text{test}})|\varphi({\mathcal{D}}_{\text{gen}}); \hat{\Sigma}_{\text{gen}})$ and $nll_{\text{baseline}} = \log {\mathcal{N}}(\varphi({\mathcal{D}}_{\text{test}})|\varphi({\mathcal{D}}_{\text{baseline}}); \hat{\Sigma}_{\text{baseline}})$ respectively. \textbf{FLS is then defined as:}
\begin{align}
   
    \text{FLS}({\mathcal{D}}_{\text{gen}}) &:=  \exp \left( 2 \frac{nll_{\text{baseline}} - nll_{\text{gen}}}{d} \right ) \times 100.
    \label{eqn:fls_overall}
\end{align}
where $d$ is the dimension of the feature space. This is equivalent to looking at the $d^{th}$ root of the likelihood ratio. A visual depiction of the process is provided in Fig.~\ref{score_computation_both}. Intuitively, the score can be considered as a grade with a value of 100 indicating a ``perfect" generative model that does as well as a set of samples drawn from the data distribution. Lower values are indicative of problems in some of the three areas evaluated by FLS. Poor sample quality will lead to Gaussian centers that are far from the test set and thus a lower likelihood. A failure to sufficiently cover the data manifold will lead to some test samples having very low likelihoods. Finally, overfitting to the training set will yield the MoG density estimator to overfit and yield a bad likelihood value on the test set.



\section{Introduction}

Data generation and simulation are a some of the fastest-growing use cases of deep generative models, with success stories spanning the artificial intelligence spectrum \citep{karras2020analyzing, brown2020language, wu2021protein}. Despite the growth of applications---and unlike supervised or reinforcement learning---there is lack of clear consensus on an evaluation protocol that is equally applicable to any generative modeling family. With the wide-spread adoption of deep generative models and the growing concerns regarding their data privacy~\citep{carlini2023extracting, hitaj2017deep}, especially in high-stakes and industrial production environments, it is critical to not only reliably evaluate the efficacy of generative models beyond sample quality and diversity but also to diagnose potential failure modes such as memorization.

\begin{figure}[h!]
\vspace{-10pt}
\centerline{\includegraphics[width=\linewidth]{figures/headline_plot.png}}
\vspace{-10pt}
\caption{FLS for a generative model that overfits (higher is better). (a) Generated samples are low quality and do not match the underlying Two Moons data $\implies$ low FLS. (b) Generated samples are high quality, diverse, and are not overfit to the training set $\implies$ high FLS. (c) Generated samples are still high quality and diverse \textbf{but too closely resemble the training set} $\implies$ lower FLS.}
\label{fig:headline_plot}
\end{figure}

Current approaches to evaluating generative models are either limited in their applicability to certain model classes (e.g. likelihood) \citep{van2021memorization} or are one-dimensional scores that miss important facets for evaluation \citep{xu2018empirical, esteban2017real, meehan2020non}. In particular, empirically observed phenomena such as memorization, overfitting, mode collapse, and mode dropping are often overlooked in favor of purely sample quality based metrics such as FID and Inception score. While there exists a body of work \citep{webster2019detecting, alaa2022faithful, meehan2020non} dedicated to identifying memorization and overfitting in generative models these themselves have various problems such as being too dependent on nearest neighbor computations or are unable to assess sample quality directly. Thus, at present, there lacks an evaluation measure capable of assessing both sample quality and the generalization capabilities of a generative model.


\xhdr{Current Work}
We propose a new sample-based score, the Feature Likelihood Score (FLS) that captures sample quality,  is as scalable as popular sample-based metrics such as FID and Inception score and, crucially, is able to \textbf{diagnose overfitting}. Intuitively, FLS is derived from the commonly used likelihood evaluation protocol but is also able to assess the generalization performance of the generative model in a similar manner to most supervised learning setups. Evaluation using FLS has many consequential benefits:

\begin{enumerate}[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt, leftmargin=*]
    \item \textbf{Interpretability:} As scores improve sample quality also improves matching intuition and interpretability.
    \item \textbf{Diagnosing Overfitting:} As overfitting begins (i.e., memorization of the training set) FLS reports an inferior score \emph{despite} any drop in sample quality and diversity.
    \item \textbf{Holistic Evaluation:} FLS provides a nuanced evaluation of generative models revealing that some models are overvalued by the FID metric in their (generalization) performance, while many models that have slightly higher FID are better ranked by FLS.
    \item \textbf{Universal Applicability:} FLS is applicable to all generative models including popular methods like VAEs, Normalizing Flows, GANs, and Diffusion models with minimal overhead as it is computed using only samples. 
\end{enumerate}


\looseness=-1

At an intuitive level, FLS achieves this goal by revisiting the vanilla likelihood metric but in the feature space of a pre-trained Inception-v3 network~\citep{szegedy2016rethinking} or a pre-trained CLIP network~\citep{radford2021learning}. As most models lack explicit densities, we model the density of the generative model in the feature space by using a mixture of isotropic Gaussians (MoG) whose means are the features of the generated samples. Unlike, conventional likelihood evaluation, FLS is purposely built to diagnose overfitting as it relies on estimating the (perceptual) likelihood of some held-out test data. 

Specifically, the key step of our method is that the variance of each isotropic Gaussian is learned by maximizing the likelihood of the set of training examples that were used to train the generative model. Intuitively, the learned Gaussians collapse to Diracs whenever a generated sample is too close to a training example, as illustrated in Fig.~\ref{fig:density_fit}. We then take advantage of this overfitting by evaluating the log-likelihood of the test set. A model that produces overfit samples will yield an overfit MoG that will have a poor log-likelihood on the test set and hence be punished for its overfitting. 


We demonstrate the utility of FLS by conducting extensive experiments across a large variety of generative models for the datasets CIFAR10 \citep{krizhevsky2014cifar}, Imagenet \citep{deng2009imagenet}, and LSUN \citep{yu2015lsun}. In particular, we demonstrate that FLS is able to accurately detect overfitting behavior in synthetic settings where other metrics fail \ref{sec:experiments} while for natural image datasets we show that FLS is well correlated with popular sample-based metrics like FID, demonstrating it is a good judge of sample quality. In addition, FLS is able to comment on the generalization ability of these trained models both quantitatively and qualitatively, providing an estimate of the degree of overfitting and which samples are the worst offenders. Finally, FLS sheds light on which existing pre-trained models are underrated or overrated by FID---enabling a quantitative appreciation for certain model classes such as DDPM \citep{ho2020denoising} while adding additional scrutiny to models like StyleGAN-XL \citep{sauer2022stylegan}.




\section{Related Work}
The prevalence of deep generative models and their impressive capacity to generate highly realistic data samples has led to the creation of many evaluation metrics. 

\xhdr{Likelihood evaluation}
The most common metric, and perhaps most natural, is the negative log-likelihood (NLL) of the test set, whenever it is easily computable. While appealing theoretically, in most cases, the generative model doesn't provide a density (e.g. GANs) or it is only possible to compute a lower bound of the test NLL (e.g. VAEs, continuous diffusion models, etc.) \citep{burda2015importance, song2021maximum,huang2021variational}. Even when possible, log-likelihood-based evaluation suffers from a variety of pitfalls in high dimensions and may often not correlate with higher sample quality \citep{nalisnick2018deep,le2021perfect}. Indeed many practitioners have empirically witnessed phenomena such as mode-dropping, mode-collapse, and overfitting \citep{yazici2020empirical} all of which are not easily captured simply through the NLL.

\looseness=-1
\xhdr{Sample-based evaluation}
Generative models can also be evaluated purely based on the quality and diversity of their generated samples with popular sample-based metrics such as the Inception score, FID or precision/recall \citep{salimans2016improved,heusel2017gans, sajjadi2018assessing}. However, a key limitation of these metrics is that they don't differentiate between good samples and overfit samples.

\looseness=-1
\xhdr{Overfitting evaluation}
Several approaches seek to provide metrics that are capable of detecting overfitting in generative models. These scores can again be categorized based on whether one can extract an exact likelihood \citep{van2021memorization} or a lower bound to it via annealed importance sampling \citep{wu2016quantitative}. 
For GANs, popular approaches include training an additional discriminator in a Wasserstein GAN that does not provide gradients to the generator \citep{adlam2019investigating} and adding a memorization score to the FID metric \citep{bai2021training}.
Alternate approaches to detect overfitting seek to find real data samples that are closest to generated samples via membership attacks \citep{liu2018generative, webster2019detecting}. Recently, non-parametric tests have been employed to detect memorization or exact data copying in generative models \citep{xu2018empirical, esteban2017real, meehan2020non}. Parametric approaches to detect data copying have also been explored such as using neural network divergences \citep{gulrajani2020towards} or using latent recovery \citep{webster2019detecting}. Finally, \citet{alaa2022faithful} propose a multi-faceted metric with a binary sample-wise test to determine if a sample is authentic or not (i.e. overfit).

