
%\vspace{-2em}
\section{Introduction}
%\vspace{-1em}
Over the past decade, significant progress has been achieved in enhancing implicit generative models (GANs \citep{goodfellow2014generative}, VAEs \citep{kingma2022autoencoding}, and diffusion models \citep{sohldickstein2015deep, ho2020denoising}), leading to their extensive use in diverse domains like image and graph generation. In the image generation domain, efforts have been made to standardize evaluation metrics \citep{Wang2004ImageQA, Zhang2018TheUE, NIPS2017_8a1d6947} for comparing the effectiveness of different implicit generative models. However, the graph generation domain has yet to adopt a similar standardization. Moreover, while visual inspection of an image can reveal much about its semantic characteristics, this cannot be applied to graphs.
Perhaps, more importantly, the application of graph generative models in different areas is quite different than image generators.
Instead of aiming to generate an image that looks like others, most graph generative models are designed in hopes that they will be able to generate novel yet interesting graphs, e.g., new molecules with specific properties.

While extrapolating far from the known distribution of graphs is indeed challenging, there's potential for generative models to explore novel graphs within underexplored regions of the graph space by leveraging patterns observed in existing graphs. We illustrate this concept and our proposed evaluation methodology in \autoref{fig:shifted-split-illustration}, using molecules as an example. Thick support regions represent known molecules, while thin support regions denote the space of novel graphs. We note that unlike this toy 2D illustration, real graph distributions (like image distributions) are expected to have many areas of thin support in high dimensions though they may be difficult to identify or characterize.
Thus, the question arises: \emph{How can we measure a graph generative model's ability to generate novel graphs on thin support regions?}
%How can we properly evaluate the generalizability of graph generative models?
\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.9\columnwidth]{figures/shifted-split-illustration.png}
    \caption{
    \name systematically thins the distribution in a certain region for training (top row) and then evaluates whether the generated samples in the thinned region after reweighting matches the complementary held-out test dataset (bottom row).
    In contrast, standard evaluations will seek to match the macro properties (e.g., mean) of this distribution which emphasizes the regions of thick support.
    The original data (left) illustrates both thick support regions (i.e., areas with many samples) and thin support regions (i.e., areas with very few samples).
    }
    \label{fig:shifted-split-illustration}
\end{figure}



The most intuitive and potentially ideal evaluation approach would involve computing the negative log-likelihood on a test dataset. This metric, relying on the KL divergence, is inherently sensitive to thin support regions. However, for modern implicit generative models, log-likelihood is difficult to compute exactly or even approximate well.

Given these challenges, most recent evaluations of generative models seek to compare statistics between generated samples and a held-out test set. 
A simple approach is to merely compare the means of these distributions or the means of various graph properties.
Extending the difference in means to the worst case difference between the expectation of a function is known as Maximum Mean Discrepancy (MMD).
The current and most commonly used standard procedure for evaluating graph generative models is to compute the MMD for the degree, clustering coefficient and orbit count distributions between the generated samples and a held-out set \citep{Niu2020PermutationIG, Chen2021OrderMP, Liao2019EfficientGG, pmlr-v162-hoogeboom22a, Vignac2022DiGressDD}. 
However, these mean-based approaches focus on the regions of thick support where the most mass is.
Thus, they can fail to detect a generative models' performance on the thin support regions---the exact regions where novel graphs could exist. 
We illustrate this problem in more detail \autoref{sec:appA}.
\iffalse
We illustrate this problem in\autoref{fig:metric-illustration} where we show that using cross-validation with the difference in means or the more complex Wasserstein-1 distance\footnote{Wasserstein-1 distance is also an integral probability metric like MMD but uses a different class of functions in the optimization problem. Wasserstein-1 was chosen for this illustration because it can be computed efficiently and has no hyperparameters.} does not provide useful signal for selecting a model whereas negative log-likelihood provides a strong signal.
In summary, prior evaluation approaches for implicit generative models are limited in their ability to measure performance on thin support regions.

\begin{figure}[!ht]
    \centering
    \includegraphics[width=\columnwidth]{figures/relative-density.png}
    \includegraphics[width=\columnwidth]{figures/metric-illustration.png}
    \caption{
    While using standard cross-validation with mean difference or Wasserstein-1 metrics does not provide reliable signal to select the right model, our vertical validation (VV) method with mean difference or Wasserstein-1 provides reliable information on the best model and matches the ranking of the negative log-likelihood.
    This illustration uses the 2D dataset from \autoref{fig:shifted-split-illustration} as ground truth, and the relative density of the ``thin region'' is varied to represent different estimated models (top).
    For both standard validation and vertical validation, we use 10 folds and 30 repetitions and show the standard deviation for each method.
    }
    \label{fig:metric-illustration}
\end{figure}
\fi


To address this evaluation gap, we focus on matching the statistics (e.g., KS \citep{KolmogorovSmirnov1933SullaDE}) of systematically constructed thin regions of support as illustrated in \autoref{fig:shifted-split-illustration}.
Inspired by the classic train-test split idea, we develop a novel method to ``vertically'' split the graph dataset into train and test datasets depending on one graph property.
Then, after training, we reweight generated samples and compare them to the corresponding held-out test dataset.
At a high level, our evaluation approach, called Vertical Validation (VV), artificially simulates a thin region, but then has ground truth samples from this thinned region to compare against.
After reweighting, any metric that can handle weights could be used to compare the generated samples to the held-out samples.
We choose the average KS statistic along graph property distributions though other metrics could be used within our framework.
This procedure enables the evaluation of the generation capabilities in localized thin support regions rather than focusing on the thick support regions.
%In \autoref{fig:metric-illustration}, we show that our \name approach provides reliable signal for selecting the correct model in this toy example.
%However, our vertical cross-validation approach (VV) and negative log-likelihood provide reliable signal for selecting the correct model that will generate well on the thin support regions.
We summarize our contributions as follows:
\begin{enumerate}
    \item We develop a novel ``vertical'' train-test splitting approach that systematically creates thin support in the training data while the testing data has thick support in this region. This can be applied to arbitrary 1D distributions and includes two hyperparameters that control the split sharpness and thickness of full support.
    \item We combine this split procedure with a reweighting step to form a novel methodology for evaluating the ability to generate data in thin support regions. We prove that this metric instantiated with the KS statistic is consistent.
    \item We empirically validate our \name approach for model selection in the thin support regime of synthetic datasets and then apply \name to compare representative graph generative models on two popular graph datasets.
    
\end{enumerate}





