
\section{Background and Related Work}
\label{sec:related_work}

\paragraph{Evaluating Graph Generative Models}
Several methods have been used for evaluating the performance of graph generative models. 
Some methods can be used for all graph types. These methods include novelty, uniqueness, Wasseretian distance between generated samples and a held-out set, Maximum Mean Dependency (MMD) for the degree, clustering coefficient and orbit count distributions between the generated samples and a held-out set as used by \citet{LiaoGRANs, martinkus2022spectre, Vignac2022DiGressDD, pmlr-v162-hoogeboom22a}. 

On the other hand, some of the metrics  are  specific for molecular generation tasks, such as the Frechet ChemNet Distance(FCD) introduced by \citet{Preuer2018FrchetCD}, or the Neighbourhood subgraph pairwise distance kernel (NSPDK) MMD introduced by \citet{Costa2010FastNS}. Other metrics include the  percentage of atom stability, molecule stability, validity of generated molecules as used by \citet{pmlr-v162-hoogeboom22a, Vignac2022DiGressDD} and others.

 As one critique of prior evaluations, \citet{o'bray2022evaluation} noticed that metrics based on MMD were sensitive to the choice of the kernel functions, the parameters of kernel, and the parameters of the descriptor function. \citet{thompson2022evaluation} also noted that current evaluation methods do not accurately capture the diversity of the generated samples,
 which lead them to propose their own approach based on using the graph embedding produced by GIN \citep{Xu2018HowPA} and calculating metrics on that embedding to better capture diversity.
 
\cite{southern2023curvature} recently proposed the use of curvature descriptors and  topological data analysis for a more robust and expressive metric for evaluating graph generative models but does not specifically consider thin support regions.
 Despite this progress, there are still deficiencies in the current metrics particularly, when it comes to measuring the ability of the model to generate data in thin support regions.
\paragraph{Related Train-Test Validation Methods}

While classic cross validation methods sample form i.i.d. splits \citep{Arlot2009ASO}, our approach creates splits that are nearly out-of-distribution, which means that there is a distribution shift between train and test.
Evaluating models under distribution shift has been studied for supervised learning under the names of domain adaptation (DA) \citep{Farahani2020ABR} and domain generalization (DG) \citep{Koh2020WILDSAB}. In both cases, the accuracy metric is evaluated on a test distribution that is different from the training distribution. However, both DA and DG primarily consider supervised learning tasks while we consider generative models. Thus, our approach can be viewed as a type of distribution shift evaluation for generative models.

In a similar vein, \cite{bazhenov2023evaluating} propose a method for splitting the \emph{nodes} of a graph into in-distribution and out-of-distribution nodes based on structural properties.
This splitting enables the evaluation of node-level prediction tasks under distribution shifts.
We differ from this work because we split along graphs instead of along nodes, and ours is aimed at evaluating generative models while theirs is focused on node-level tasks.
