\section{Experiments}
In our experiments, we first demonstrate the effectiveness of our method in measuring the generalization of graph generative models. Once this foundation is established, we proceed to compare different graph generative models using our approach. 

\subsection{Validating the Effectiveness of \name} 
\label{sec:exp_val}
For the experiments in this section, we focus on one split property $\ell$ and one split index $j$ in \name, similar to concentrating on a single fold in the traditional CV scheme, which we will denote as HV for Horizontal Validation.
 We also opt to use data with a known distribution, providing access to ground truth for generating additional samples when necessary. We generate 500 Erdős-Rényi random graphs \citep{ERGraphs} with $n_{\textnormal{nodes}} = 20$ and $p=0.5$. Additionally, we generate 500 samples of the community dataset (which we refer to as Comm20), consisting of two equally sized communities with $n_{\textnormal{nodes}}$ ranging between 6 and 10 per community (12-20 in total). The probability of an edge within each community is $p=0.7$, while across communities, it's a function of the number of nodes within a community, $p_{int} = \frac{0.1}{n_{\textnormal{nodes}/2}}$. 


In our experiments, we explore five possible model types: 1) \textbf{E.Memo}, Exact Memorization, represents a model that memorizes the original training data, then sample ``new" graphs only by bootstrapping from it. 2) \textbf{A.Memo}, Approximate Memorization, represents a model that memorizes the training data but introduces subtle variations during sampling by adding or removing an edge from graphs in the memorized data.
3) \textbf{Oracle}, represents a model that is  capable of generating data directly from the ground truth distribution (For Erdős-Rényi graphs the parameters of such model is $n_{\textnormal{node}} = 20$ and $p = 0.5$, for Comm20, the parameters of such model is $n_{\textnormal{node}} = 12-20$ and $p = 0.7$), which serves as the benchmark for the ideal generative model; 4) \textbf{Close}, representing a model that posses the ability to generate data from a distribution that is somewhat ``close" to the ground truth distribution (generating new graphs for Erdős-Rényi with $n_{\textnormal{node}} = 20$ and $p = 0.45$, and for Comm20 with $n_{\textnormal{node}} = 12-20$ and $p = 0.65$). 5) \textbf{Far}, representing a model that simulates under fitting as it generates data from a distribution that is not close enough to the ground truth (generating new graphs for Erdős-Rényi with $n_{\textnormal{node}} = 20$ and $p=0.4$, and for Comm20 with  $n_{\textnormal{node}} = 12-20$ and $p = 0.6$).


 To illustrate the usefulness of our \name approach in model selection, we use \name with $\epsilon = 0.01$, $\ell = 2$ (corresponding to the split property: number of triads), $\sharpness = 10$ and $k = 5$ to create a train-test split. We choose to hold out the last split corresponding to $j = 5$ and we will refer to that part as v-test and to the parts corresponding to $j = 1, 2, 3, 4$ as train. Next we use \name again to further split the train part. In this case we use $\epsilon = 0.01$, $\ell = 2$, $\sharpness = 10$ and $k = 4$ to create  a train-val split. We again choose to hold out the last split corresponding to $j = 4$ and will refer to it as v-val, while the splits $j = 1, 2, 3$ are referred to as v-train. Then in an alternate setting, we use HV (which we can achieve by adjusting our model parameters to $\epsilon = 1$ and $\sharpness = 1$) also on the train part to create train-val splits or folds, we assign the label of h-val to one of those folds (since they are all similar) and the rest we label h-train. The general idea is that v-test is an area we are interested in but don't have access to, traditionally we would use HV and splits like h-train/h-val to choose the best model. However we argue that this approach isn't ideal if the goal is having a model capable of generalizing to areas of thin support, and that using a split like v-train/v-val can help us choose the best model for that task. To illustrate this, we train the five models that we previously introduced on Erdos-Renyi and Comm20 datasets, each of the models is trained and tested on train/v-test, v-train/v-val and h-train/h-val (those splits are illustrated in \autoref{fig:splitting} in \autoref{sec:splitting}), we then report the averaged KS for testing on the average degree ($\phi_{ks}^D$) and the average clustering coefficient ($\phi_{ks}^C$) properties for all models and split types in \autoref{fig:synthetic-vertical-validation}, where we see some trends. First, in the results on h-val, we see that E.MEMO and A.MEMO seem to have lower KS values than oracle, this can be explained by the fact that both these models care only about in-distribution performance, and as such can achieve low results compared to the oracle that should have been the best model if we hadn't initially held out v-test ie. the oracle produced data covering all of the support, but our h-val only covers part of that support and hence HV viewed the performance as inferior. Second, we notice that the oracle for both of our cases (v-val and v-test) is indeed regarded as the superior model since our goal is to give higher rankings to models that generalizes better. Third, as a side effect of our setting we were also able to detect that E.MEMO and A.MEMO are models that are inferior to oracle since they won't be able to generalize.
 To summarize, if a user was interested in a model capable of generalizing to regions of thin support, then the conventional HV setting is misleading in model selection, while a \name setting is favourable.


\begin{figure*}[!ht]
    \centering
    \begin{subfigure}[t]{\columnwidth}
    \centering
        \includegraphics[width=0.75\textwidth]{figures/erdos_plot.pdf}%erdos-vertical-validation.png}
        \vspace{-1em}
        \caption{Result for Erdos-Renyi graphs}
        
    \end{subfigure}
    \begin{subfigure}[t]{\columnwidth}
    \centering
        \includegraphics[width=0.75\textwidth]{figures/comm_plot.pdf}%comm20-vertical-validation.png}
        \vspace{-1em}
        \caption{Results for Comm20 graphs}
        \vspace{-1em}
    \end{subfigure}
    \caption{On both Erdos-Renyi and Comm20 datasets, our proposed vertical validation approach (VV) can select the best model for generating in thin support as shown by the test line, whereas the standard train-test splitting (CV) tends to favor memorization despite poor generalization to the thin support regions.
    Note that for CV, the oracle distribution seems worse than memorization because oracle is generating from the true distribution rather than the shifted training distribution---thus it appears to CV that memorizing is actually a better option. This phenomena does not happen in our validation approach because we aim to find a model that generalizes to the thin support well.
    This also showcases that our approach is better able to detect memorization than the standard train-test split validation.
    }
    \label{fig:synthetic-vertical-validation}

\end{figure*}





\subsection{Comparing Models with \name}
\subsubsection{Using \name in a Train-Val-Test context}
\label{sec:vtrain-val-vtest}
Building on the empirical validation of our metric presented in the previous section, we now proceed to compare real graph generative models on realistic datasets using our \name method. We have implemented our approach in a manner consistent with the validation experiments described in \autoref{sec:exp_val}. However, in this section, the different models correspond to various representative graph generative models. 
%Additionally, we have expanded the properties of interest to five properties, i.e., $m = 5$. % this is later when talking about qm9

\paragraph{Models and Datasets} We select two representative graph generative models for comparison: DiGress \citep{Vignac2022DiGressDD}, a discrete diffusion-based model that introduces noise to graphs and then trains a graph transformer to revert the process, and  GDSS \citep{Jo2022ScorebasedGM}, a score-based diffusion model employing stochastic differential equations (SDEs) to generate node, edge attributes, and adjacency matrices jointly.
%; and GGAN \citep{krawczuk2021gggan}, a GAN-based model that employs adversarial training to generate graphs. 
For this experiment, we choose Qm9, a commonly used molecular dataset of 130,831 small molecules. We selected five properitie (i.e., $m=5$) for Qm9: average degree, molecular weight (Mlwt), Topological Polar Surface Area (TPSA) \citep{Prasanna2009TopologicalPS}, ring counts, and the logarithm of the partition coefficient (logp).
We preprocessed the dataset by removing hydrogen atoms and filtering out molecules where any of the five properties could not be calculated using the rdkit package.
Additional experimental details are outlined in \autoref{sec:add_exp}.

\paragraph{Model Comparison Experiment}
We aim at judging the performance of different generative models by their ability to generalize. To accomplish this, we use TPSA as the split property (which corresponds to index $\ell=3$) with parameters $\sharpness = 10$, $\epsilon = 0.01$, $k = 5$ and hold out the split corresponding to $j = 5$ as the v-test portion.
Then, we further split the data with $k = 4$ and hold out the split corresponding to $j = 4$ as the v-val portion. The remaining splits $j = 1,2,3$ correspond to v-train and are used for training the model.  We then evaluate the performance of these models with respect to v-val and v-test for each model type %and for both datasets 
by generating samples from the trained models such that the effective number of samples is  
$n_{\text{eff}} = \min(1000,|\Xtest^{(\ell,j)}|)$. Finally, we calculate the $\phi_{KS}$ scores on the five properties of interest excluding the cases when the test property is the same as the split property ($\ell' \neq \ell$). 



We see in \autoref{tab:compare_models_qm9_2} that our VV method using v-val is able to correctly select the model which will perform best on thin support, i.e., perform the best on the held-out v-test.
In most cases, GDSS is better on both v-val and v-test, but in the case of LogP, DiGress is better on both v-val and v-test.

Both models seem to struggle when it comes to molecular weight suggesting the inherent difficulty of generalizing over that property. 
This result showcases that using VV can properly select between model classes when generalization on thin support is desired---and this selection may depend on the test property.



\begin{table}[t]

%\parbox{0.85\linewidth}{
\centering
\caption{$\phi_{KS}$ values for different test properties for Qm9 when compared against v-val and v-test where $\phideg$, $\phimwt$, $\phirc$, $\philogp$, $\phiavg$ are average degree, molecular weight, average ring counts, average logP, and the total average over all these properties respectively}
\resizebox{\linewidth}{!}{
\label{tab:compare_models_qm9_2}
\vspace{-1em}
\begin{tabular}{p{0.04\linewidth}|p{0.14\linewidth}|p{0.14\linewidth}p{0.14\linewidth}p{0.14\linewidth}p{0.14\linewidth}|p{0.14\linewidth}}
    
     & Model & $\phideg$ & $\phimwt$ & $\phirc$ & $\philogp$ & $\phiavg$ \\
     \hline \\[-2ex]
     
     \multirow{3}{*}{\rotatebox[origin=c]{90}{\parbox[c]{0cm}{\centering \normalsize{vval}}}}  %& E.Memo & 0.250 & 0.220 & 0.247 & 0.285 & 0.155 \\
     &DiGress& 0.206 & 0.531 & 0.206 & \textbf{0.048} & 0.248 \\
     &GDSS& \textbf{0.053} & \textbf{0.343} & \textbf{0.053} & 0.083 & \textbf{0.133} \\
     \hline  \\[-2ex]
          \multirow{3}{*}{\rotatebox[origin=c]{90}{\parbox[c]{0cm}{\centering \normalsize{vtest}}}}  %& E.Memo & 0.250 & 0.220 & 0.247 & 0.285 & 0.155 \\
     &DiGress& 0.216 & 0.568 & 0.204 & \textbf{0.109}& 0.274 \\
     &GDSS& \textbf{0.058} & \textbf{0.444} & \textbf{0.058} & 0.123 & \textbf{0.171}\\
    
\end{tabular}
}

\end{table}

 
\vspace{-0.8em}
\paragraph{Exploratory Visualizations:} To explore the properties of our metric better, we used the samples generated from the experiment above, sorted them in descending order according to the weights assigned by our approach, then filtered for validity and novelty to get the top weighted 100 molecules. We then visualized the top four of these molecules generated by DiGress when testing against v-test in \autoref{fig:digress_vtest} and visualized the rest of the top generated molecules in \autoref{sec:add_exp}. Furthermore, in \autoref{fig:dist_val_test}, we indicate the value of the split property (TPSA) of those 4 molecules and their location with respect to the entire distribution of the TPSA property. As expected, the samples with the higher weights tend to be from the region that the data was held from. 

\begin{figure}[!ht]
    \centering
     \includegraphics[width=.7\linewidth]{figures/digress_test.png}  
    \caption{ Example of the generated molecules from DiGress. These are the top 4 -after filtering for validity and novelty- according to the weights assigned by our method when using v-test as the held out portion. 
    }
    \label{fig:digress_vtest}
    \vspace{-0.5em}
\end{figure}


\begin{figure*}[!ht]
    \centering
    \begin{subfigure}[t]{0.8\columnwidth}
    \centering
        \includegraphics[width=\textwidth]{figures/ecdf_val_2.png}%erdos-vertical-
    \end{subfigure}
    \raisebox{0.5cm}{ 
    \begin{subfigure}[t]{0.4\columnwidth}
    \centering
        \includegraphics[width=\textwidth]{figures/legend.png}%erdos-vertical-validation.png}
    \end{subfigure}
    }
    \begin{subfigure}[t]{0.8\columnwidth}
    \centering
        \includegraphics[width=\textwidth]{figures/ecdf_test_2.png}%comm20-vertical-
    \end{subfigure}
    \caption{Both figures show the distribution of the TPSA property for v-train/v-val/v-test. As expected the top four highest weighted molecules after filtering for validity and novelty are in high density regions of v-val (left) and v-test (right). The distribution of v-train is in blue and is overlayed with the distribution of the v-val portion in yellow and v-test in green. We plotted where would our top molecules lie with respect to their calculated TPSA value (the value of the y-axis is not meaningful for the molecules, we varied it across the molecule for ease of visualization). We use the first letter to signify which model generated the sample D for \textcolor{red}{DiGress} and G  for \textcolor{blue}{GDSS}.
    }
    \label{fig:dist_val_test}
    \vspace{-1.2em}
\end{figure*}


\subsubsection{Using \name in a Cross-Validation-Like Context}
\label{sec:scv}

In this section, we choose to train exhaustively on all the splits from the corresponding split properties and split indices. This approach is similar in spirit to cross-validation, but rather than having $k$ folds only, we will have $k\times m$ folds, corresponding to each split index $j \in {1,.., k}$ and split feature $\ell \in {1,...,m}$. For our experiments, we chose $k=4$ and $m=5$ resulting in a total of 20 different data splits. We trained each model on our splits separately and generated samples from these trained models %For each model type: DiGress, GDSS or GGAN, we trained 20 different models, one per different split of data (we have 20 splits as we choose $k=4$ and $m=5$). We then generated samples from these trained models 
such that the effective number of samples is 
$n_{\text{eff}} = min(1000,|\Xtest^{(\ell,j)}|$). Finally, we evaluated the performance of these model on the 5 properties of interest excluding the cases when the test property is the same as the split property ($\ell' \neq \ell$).

In addition to the representative models mentioned in \autoref{sec:vtrain-val-vtest}, we also include an additional model GGAN \citep{krawczuk2021gggan}, which is a GAN-based model that employs adversarial training to generate graphs. This model is suitable only for non molecular datasets as node properties are not easily incorporated. Thus we choose to evaluate all 3 models (DiGress, GDSS and GGAN) on a variation of the previously used community dataset which we descibe in more details in \autoref{sec:add_exp} and refer to as Comm dataset, and we choose to evaluate DiGress and GDSS on Qm9 again in this current context. We elaborate more on the results of Comm dataset below, and also present the results on Qm9 in \autoref{tab:compare_models_qm9} for completeness. For the results on the Comm dataset, We aggregate according to different use cases depending on the user's interest.

%The results are adaptable and customizable to suit the user's specific interests, reducing the need to train all 20 different models. We present multiple use cases to showcase the versatility of our metric. 

\emph{Use Case 1}: The user seeks the best model for overall performance across all predefined $m$ properties: we calculate the average $\phi_{KS}$ across all combinations of $\ell$, $j$, and $\ell'$, with $\ell \neq \ell'$, and we report a single averaged value per model. Lower values indicating better generalization capabilities. To demonstrate this case, we report the overall average performance for: DiGress: $0.567$ GDSS: $0.51$  and GGAN: $0.17$. From these numbers, it seems that GGAN performed the best, followed by GDSS, with DiGress falling slightly behind. 
%old nums digress $0.548$...$0.502$ ....$$0.449$
It can also be useful to consider more specific use cases rather than this broad one.

\emph{Use Case 2}: The user aims to identify a model that excels in generalizing over a specific property of interest: we can compute separate average $\phi_{KS}$ values for each test property $\ell'$ by averaging across all ($\ell$,$j$) splits s.t $\ell' \neq \ell$. This will enable the user to make decisions based on the performance of their specific property of interest. We report the results of this use case in \autoref{tab:compare_models}. Based on the results, GGAN generally outperforms the other models. %when the property of interest is average degree, number of triads, and average shortest path length. 
GDSS occupies middle ground, %except for average clustering, where it ranks highest. 
DiGress consistently falls behind. %except for maximal cliques. 
To further understand this behaviour, We examined the ECDFs of the generated properties for the models, and compared them to the ECDFs of the held-out data (see \autoref{sec:add_exp} for a list of Figures), and noticed an overall trend where the original data (held-out) tend to have pronounced discontinuities, a characteristic which GGAN tend to replicate with fewer modes. In contrast, models such as DiGress and GDSS demonstrate smoother distributions. Additionally, it appears that average shortest path length is the most challenging property for GDSS and DiGress models to capture and generalize correctly. Through examining the ECDFs we conjecture that GDSS and DiGress are not able to concentrate the generations near the middle of the distribution. Given their reliance on diffusion-based mechanisms, this observation could imply that while these models may perform adequately at higher noise levels, they exhibit diminished precision at lower noise levels.

\begin{table}[t]

%\parbox{0.85\linewidth}{
\centering
\caption{Average $\phi_{KS}$ values for different test properties in Use Cases 2 and 3: $\phideg$, $\phitriads$, $\phishort$, $\phiclus$, $\phiclique$ are average degree, average number of triads, average shortest path length, average clustering coefficient, average maximal cliques, respectively}
\resizebox{\linewidth}{!}{
\label{tab:compare_models}

\begin{tabular}{p{0.1\linewidth}|p{0.16\linewidth}|p{0.1\linewidth}p{0.1\linewidth}p{0.1\linewidth}p{0.1\linewidth}p{0.1\linewidth}}
    
     & Model & $\phideg$ & $\phitriads$ & $\phishort$ & $\phiclus$ & $\phiclique$ \\
     \hline \\[-2ex]
     
     \multirow{3}{*}{\rotatebox[origin=c]{90}{\parbox[c]{1cm}{\centering \footnotesize{Use Case 2}}}}  %& E.Memo & 0.250 & 0.220 & 0.247 & 0.285 & 0.155 \\
     &DiGress & 0.678 & 0.59 & 0.779 & 0.493 & 0.333  \\
     &GDSS & 0.546 & 0.511 & 0.727 & 0.484 & 0.285  \\
     &GGAN & \textbf{0.334} & \textbf{0.108} & \textbf{0.188} & \textbf{0.165} & \textbf{0.057} \\
     \hline \\[-2ex]
     \multirow{3}{*}{\rotatebox[origin=c]{90}{\parbox[c]{1cm}{\centering \footnotesize{Use Case 3}}}} %& E.Memo & 0.313 & 0.281 & 0.259 & 0.312 & 0.170 \\
     &DiGress & 0.688 & 0.601 & 0.783 & 0.437 & 0.294  \\
     &GDSS & 0.516 & 0.47 & 0.709 & 0.492 & 0.293  \\
     &GGAN & \textbf{0.376} & \textbf{0.06} & \textbf{0.155} & \textbf{0.167} & \textbf{0.044}  \\
\end{tabular}
}
\vspace{-1.8em}
\end{table}




\emph{Use Case 3}: The user seeks a model that generalizes well on the edges of the distribution for a particular property: we can compute the average $\phi_{KS}$ of test properties $\ell'$ across all combinations of $\ell$ values (with $\ell' \neq \ell$) and for only $j = 1$ and $j = 4$ (since these particular splits are focused on the edges of the distribution, as illustrated in \autoref{fig:scv-illustration}). The results are presented in \autoref{tab:compare_models} and are also consistent with Use Case 2, and they reveal distinct strengths and weaknesses in capturing various properties. GGAN consistently exhibits strong performance across most of the properties, while GDSS remains at second place. %competitive. 
DiGress, although performing slightly better than GDSS in modeling average clustering coefficients and having a similar performance to GDSS in modeling maximal cliques, tends to rank lower overall. 

For Qm9 dataset, the scores for use case 1 for DiGress were: $0.174$, and for GDSS were: $0.096$. The results of use cases 2 and 3 are presented in \autoref{tab:compare_models_qm9}. Overall the results follow a similar trajectory to these of the Comm dataset in this setting, that is the performance of GDSS and DiGress were close, but GDSS overall achieves a better score. It is also worth comparing the results previously introduced in \autoref{sec:vtrain-val-vtest} in \autoref{tab:compare_models_qm9_2} (and in particular those under v-val) to the current results of use case 2. We see that overall the trend didn't change, however the scores on the Mlwt property got better in the later suggesting that the particular split (i.e. the combination of the choice of split property $\ell$ and split index $j$ ) chosen in \autoref{sec:vtrain-val-vtest} was a particularly hard one, as averaging over multiple splits made the scores better. This again emphasize the role of choosing the split features, and we discuss this point in more details in \autoref{sec:add_exp} and in \autoref{sec:disscussion}  

\begin{table}[H]

%\parbox{0.85\linewidth}{
\centering
\caption{Average $\phi_{KS}$ values for different test properties for Use Cases 2 and 3: $\phideg$, $\phimwt$, $\phitpsa$, $\phirc$, $\philogp$ are average degree, molecular weight, TPSA, average Ring Counts, and logp}

\resizebox{\linewidth}{!}{
\label{tab:compare_models_qm9}

\begin{tabular}{p{0.1\linewidth}|p{0.16\linewidth}|p{0.1\linewidth}p{0.1\linewidth}p{0.1\linewidth}p{0.1\linewidth}p{0.1\linewidth}}
    
     & Model & $\phideg$ & $\phimwt$ & $\phitpsa$ & $\phirc$ & $\philogp$ \\
     \hline \\[-2ex]
     
     \multirow{2}{*}{\rotatebox[origin=c]{90}{\parbox[c]{0.75cm}{\centering \small{Use Case 2}}}}  %& E.Memo &  &  &  &  &  \\
     &DiGress & 0.136 & 0.368 & 0.082 & 0.128 & 0.156 \\
     &GDSS & \textbf{0.088} & \textbf{0.171} & \textbf{0.080} & \textbf{0.080} & \textbf{0.059} \\
     \hline \\[-2ex]
     \multirow{2}{*}{\rotatebox[origin=c]{90}{\parbox[c]{0.75cm}{\centering \small{Use Case 3}}}} %& E.Memo &  &  &  &  &  \\
     &DiGress & 0.113 & 0.347 & 0.067 & 0.102 & 0.154 \\
     &GDSS & \textbf{0.100} & \textbf{0.173} & \textbf{0.067} & \textbf{0.086} & \textbf{0.070} \\
  
\end{tabular}
}
\end{table}


\section{Discussion and Conclusion} 
\label{sec:disscussion}

\paragraph{Generating Molecules on Thin Support Regions} 
In practice, novel molecule generation may focus on generating molecules within the thick support (rather than thin support) of the \emph{marginal} property distributions because those molecules would be most similar to known molecules. However, we argue that evaluating generation on thin support is still important because thin support regions in the \emph{joint} distribution could be hidden in the thick support of \emph{marginal} distributions, especially when considering a high-dimensional distributions. 
For example, consider samples on a 3D sphere. When projected onto any of the three dimensions, it will look like the support is dense near zero.
However, the distribution has no support at or near the all zero vector. 
Thus, we hypothesize that in high dimensional spaces, there are many thin support regions that are hidden. 
When we systematically create thin support regions using our approach, the goal is to measure the model's ability to generalize to thin support in general (including thin support of the \emph{joint} distribution). Thus, while in practice novel molecule generation may focus on generating molecules with the thick support of the marginal property distributions, we test the ability of the model to generate in those regions as this will reflect its ability to generate in thin support of the \emph{joint} distribution.


\paragraph{Limitations} 
Our approach when used exhaustively as presented in the experiments of \autoref{sec:scv} can be computationally burdensome, however practically we would choose only a single split feature and a single split index as we did in \autoref{sec:vtrain-val-vtest} and this would avoid the added computational cost. 
While we recommend choosing a split feature that is maximally dependent on other features as we discuss in \autoref{sec:add_exp}, choosing the split property is still an area of potential optimization. Additionally, because our method depends heavily on the joint distribution of the chosen properties, we recommend that the user pre-examine the property distributions and carefully select relevant properties, where properties with smooth distributions will likely be better for evaluation. Also, our method is limited to 1-dimensional properties. Generalizing our method to multivariate splits is an area for future work. Finally, we note that estimating sample weights is complex and while our kernel mean matching (KMM) approach worked reasonably well in our case, choosing the kernel parameters or using more advanced weight estimation approaches is an open area of exploration (more discussion in \autoref{sec:implementation-other}).
Therefore, we hope our work opens up new avenues of research

\paragraph{Conclusion} In summary, we introduced \fullname, a new framework for biased splitting and reweighting, to evaluate the generalizability of  implicit  graph generative models on thin support regions. We developed a practical algorithm to perform this given a set of graph properties. We demonstrated that this validation approach can be used to select models which will generalize better to thin support regions. Ultimately, we hope that our approach is a step in establishing more concrete and robust evaluation methodologies for  graph generative models. 

\vspace{-0.6em}
\section*{Acknowledgements}
\vspace{-0.6em}
M.E. and D.I. acknowledge support from ARL (W911NF-2020-221).
This work was funded in part by the National Science Foundation (NSF) awards, CCF-1918483, CAREER IIS-1943364 and CNS-2212160, Amazon Research Award, AnalytiXIN, and the Wabash Heartland Innovation Network (WHIN), Ford, NVidia, CISCO, and Amazon. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.



