

\section{Measuring Generalization for Implicit Graph Generative Models}

Our proposed vertical validation method (VV) can be viewed as a generalized version of cross validation with two main steps. First, we propose a new way to generate biased train-test splits \footnote{We use the term ``split'' here instead of ``fold'' because ``fold'' may seem like uniform splitting.} that are dependent on a chosen graph property (e.g., average node degree), which we call the \emph{split property}. 
In particular, our train-test splits thin some regions of the support along the split property. Second, we propose a meta evaluation metric that reweights the generated samples to unbias them and then compares them to the held-out samples using a two-sample metric on other graph properties. This second step is needed because the training sample distribution is different than the held-out test distribution based on our biased train-test splitting. The biased split and reweighting is carefully controlled using only the 1D distributions of the split property.

Under these circumstances, if the model can still produce ``good" samples in a region that had thin support, we know that the model can generalize well. Conversely, if the model does not capture the underlying smoothness of the true distribution, it may struggle to generate realistic samples in regions with reduced training data support (underfitting) or it will memorize such data (overfitting). 

\paragraph{Notation}
To describe our method, we begin by introducing some notations that we will use for the rest of the paper. Let $p(G)$ denote the true graph distribution of graphs and let $G \sim p(G)$ be a random variable representing a graph, which includes the graph itself and any node or edge attributes if available. Let $m$ be the number user-specified graph properties of interest (e.g., average node degree). These properties are defined by deterministic functions of the graph denoted by $h_\ell: \sG \to \sR$, where $\sG$ is the space of all valid graphs and $\ell \in \{1, \cdots , m\}$. 

Let ${\bf{h}}(G): \sG \rightarrow \sR^m$ be the vector function that maps $G$ to its $m$ graph property values, i.e., ${\mathbf{h}}(G) = (h_1(G), h_2(G), \cdots, h_m(G))$.
Furthermore, let $\mathbf{Z} = {\bf{h}}(G)$,
%= (Z_1,Z_2,\cdots, Z_m) = (h_1(G), h_2(G), \cdots, h_m(G))$, 
where the distribution of $Z$ is the pushforward of the graph distribution under $\bf{h}$. Let the true marginal CDFs of each dimension of $\mathbf{Z}$ be denoted by $F_{Z_\ell}(Z_\ell)$. 
%Where the vectorized version of these individual $F_{Z_\ell}(Z_{\ell})$ is $\mathbf{F}(\mathbf{Z})$ $ = (F_{Z_1}(Z_1), F_{Z_2}(Z_2), \cdots, F_{Z_m}(Z_m))$. 
We now define the random vector $\mathbf{U} \in [0,1]^m$, where each element is the corresponding CDF value of $Z_\ell$, i.e., $U_\ell = F_{Z_{\ell}}(Z_\ell) = F_{h_{\ell}(G)}(h_\ell(G)), \forall \ell \in \{1,2,\cdots,m\}$.
%We then introduce a random variable $\mathbf{U} = \mathbf{F}(\mathbf{Z})$. 
%To sum up: 
\setlength{\belowdisplayskip}{0.5pt} \setlength{\belowdisplayshortskip}{0.5pt}
\setlength{\abovedisplayskip}{0.5pt} \setlength{\abovedisplayshortskip}{0.5pt}

\vspace{-0.5em}
\paragraph{Train-Test Split Notation}
Let $(G_i)_{i=1}^n$ be our given dataset which are i.i.d. samples from $p(G)$ and where each $G_i$ is a random variable.
For $k$-fold cross validation, we introduce $m$ split variables corresponding to each graph property, denoted as
%For a pre-determined $k$ to produce biased $k$-fold cross-validation in step 1 and for any $G_i$ with $i \in \{1, 2, \cdots n\}$, we will also introduce split variables 
$S_{i,1}, S_{i,2}, \cdots, S_{i,m} \in \{1,2, \cdots,k\}$, that indicate the test split for the $i$-th graph using the $\ell$-th split property. 
%Consequently, each graph $G_i$ will have $m$ split variables corresponding to $m$ graph properties. 
Let $S_{i,\ell} \sim p(S|G_i,  h_\ell)$, where  $p(S|G_i,  h_\ell)$ denotes the splitting distribution which can depend on the graph $G_i$ and the $\ell$-th graph property. 
%Generally, $p(S_{i,\ell} = s|G_i, h_\ell) \neq p(S_{i,\ell}=s)$, i.e. the distribution of $S_{i,\ell}$ is dependent on graph $G_i$ and its $\ell$-th graph property. 
Moving forward, $i$ will be used for the index of the graphs in the sequence $(G_i)_{i=1}^n$, i.e. $i \in \{1, 2, \cdots n\}$, and $j$ is used for the index of the splits, i.e. $j \in \{1, \cdots , k\}$. 
Given these split variables, the held-out dataset from $(G_i)_{i=1}^n$ of the $j$-th split and $\ell$-th split property will be denoted as $\Xtest^{(\ell,j)} = \ldblbrace G_i | \forall i, S_{i,\ell} = j\rdblbrace$, where double curly braces $\ldblbrace \rdblbrace$ denotes a multi-set indicating that any element can have a multiplicity more than $1$. The corresponding training dataset will be denoted $\Xtrain^{(\ell,j)} =  \ldblbrace G_i | \forall i, S_{i,\ell} \neq j\rdblbrace$. %$\Xtrain^{(\ell,j)}$ and $\Xtest^{(\ell,j)}$ can be simply understood as \textit{complementary} to each other. 
Finally, let $\{\Gbar_i^{(\ell,j)}\}_{i=1}^{n_{\ell,j}}$ be $n_{\ell,j}$ i.i.d. samples from $\qGbar$, which denotes the generated graph distribution using a training algorithm $\Omega$ that only has access to $\Xtrain^{(\ell,j)}$ and let the generated dataset be denoted by $\Xgen^{(\ell,j)} = \ldblbrace\Gbar_i^{(\ell,j)} | \forall i\rdblbrace$.


\subsection{Step 1: Shifted Splitting } %via smoothed quantile splits along a graph property
\label{sec:beta-split}

In the first part of our framework, we need to define the distribution $p(S|G_i,h_\ell)$ to create $k$ biased splits for a given graph $G_i$ and $\ell$-th graph property. By biased splits we mean that the split variable depends on the graph, i.e., $P(S_{i,\ell} | G_i) \neq P(S_{i,\ell})$.
For simplicity, we will assume that the conditional distribution is only dependent on the value of the $\ell$-th graph property, i.e., $p(S|G_i,h_\ell) = p(S|Z_{i,\ell})$.
This will enable us to focus on a 1D distribution for splitting and reweighting.
Many distributions of $P(S_{i,\ell}|G_i)$ could give biased splits, but we wanted both a generic and balanced splitting method, and hence we incorporate the two constraints mentioned below.


\emph{For the first constraint}, we want our method to work generically for any arbitrary distribution of $Z_{i,\ell}$.
Thus, instead of using $Z_{i,\ell}$ directly, we only consider the CDF value of $Z_{i,\ell}$, i.e., we assume $p(S_{i,\ell} | G_i, h_\ell)=p(S_{i,\ell} | U_{i,\ell})$, where $U_{i,\ell} = F_{Z_{i,\ell}}(Z_{i,\ell}) = F_{h_{\ell}(G_i)}(h_{\ell}(G_i))$.
This means that the splitting only depends on the rank of the $\ell$-th graph property rather than a specific value and thus it can be generically applied to any graph property.
Essentially this constraint acts to restrict the space of distributions to those that only depend on $U_{i,\ell}$ making it applicable to any property distribution.


\emph{For the second constraint}, 
 we want the splits to have equal sizes in expectation to ensure that the splits are balanced. To achieve such effect, the marginals of $S$ must be uniform, i.e., we must ensure that $p(S_{i,\ell}) = 1/k$. 
 To satisfy this last constraint, we notice that we can decompose the conditional distribution via Bayes rule $p(S_{i,\ell} | U_{i,\ell}) = \frac{p(S_{i,\ell})p(U_{i,\ell} | S_{i,\ell})}{p(U_{i,\ell})}$, where $p(S_{i,\ell}) = 1/k$ to ensure equal splits and $p(U_{i,\ell})$ is the uniform distribution by the fact that $U_{i,\ell}$ is based on the CDF of $Z_{i,\ell}$.
 Thus, we can choose any distribution for $p(U_{i,\ell} | S_{i,\ell})$ that satisfies the constraint that the marginal is uniform, i.e., $\sum_j p(S_{i,\ell}=j)p(U_{i,\ell} | S_{i,\ell})$ is uniform. In other words this constraint can be simplified to finding a component distribution whose mixture is a uniform distribution and whose weights are equal to $p(S_{i,\ell})$. One such choice of $p(U_{i,\ell} | S_{i,\ell})$ could be disjoint uniform splits corresponding to the quantiles of $U_{i,\ell}$, i.e., $p(U_{i,\ell} | S_{i,\ell} = j) = p_{\mathrm{Unif}[\frac{j-1}{k}, \frac{j}{k}]}(U_{i,\ell})$, where $p_{\mathrm{Unif}[a,b]}(U) = \frac{1}{b-a}$ denotes a $\mathrm{Uniform}$ distribution between the interval $[a,b]$. 

 
 
 However, there are two issues with quantile splits: (1) sharp cutoff for splits would create unnatural sharp edges in the training and test distributions, and (2) this would mean zero support on parts of the distribution, which would mean generative models would have to extrapolate beyond their training data---something that cannot be easily done with current methods (see \autoref{fig:quantile_like} for an illustration for quantile splits).
To address the issues with the uniform quantile splits, we can resolve to using a different distribution for $p(U_{i,\ell}|S_{i,\ell})$ that satisfies the constraint mentioned above. %: 

\begin{figure*}[htp]
% \resizebox{\textwidth}{.3\textwidth}{
%height=.1\textwidth
\centering
\vspace{-1em}
%\includegraphics[width=\textwidth]{figures/figures_graph}
\begin{subfigure}{0.7\textwidth}
\hspace{-2em}
\includegraphics[page=5,width=\textwidth, trim=0cm 0.5cm 1cm 0.5cm, clip]{figures/figures_graph.pdf}
\caption{}
%\vspace{-1.5em}
\label{fig:scv-illustration}
\end{subfigure}
%\hspace*{\fill}
\begin{subfigure}{0.27\textwidth}
\hspace{-3em}
\includegraphics[width=2.8\textwidth]{figures/plot2.pdf}
\caption{}
%\vspace{-1.5em}
\label{fig:scv-illustration-2}
\end{subfigure}

\vspace{1em}
\caption{(a) The VV splitting process has 5 steps: 1) compute the relevant graph properties for each graph, 2) project samples via the CDF to a uniform distribution %(see empirical approximation in \autoref{sec:implementation-other}), 
3) Define the split distributions via a mixture of Beta distributions and a unifrom distribution, 4) Compute the split probability conditioned on $U_\ell$ using Bayes rule, and finally 5) sample the split variable based on these conditional probabilities. This will result in different splits. In the histograms above we plot the distribution of the split property $\ell$ in both the train and held parts for different splits. (b) An illustration of the reweighting process performed by VV for one of the splits (for $j = 1 $ and $\ell = 1$ where the total number of properties is $m = 2$ ).\label{fig:combined}}

\end{figure*}



%\vspace{-0.5em}
\paragraph{Using Beta Distributions to Create Smoothed Quantile Splits:}
\label{sec:beta_splits}

For the first issue of unnaturally sharp distribution edges, inspired by empirical Beta copula models \citep{segers2017empirical}, we first note that a simple mixture of Beta distributions will have a uniform marginal distribution---the exact property we need for splitting (ie. satisfies $U_{i,\ell} \sim \textnormal{Uniform}[0,1]$ ).
Specifically, a mixture of $k$ Beta distributions with parameters defined as: $\alpha_j = j$ and $\beta_j = k + 1 - j$ and for $j \in \{1,2,\cdots, k\}$ will have a uniform distribution \citep{segers2017empirical}, i.e., $\sum_j p_{\mathrm{Beta}[\alpha_j,\beta_j]}(U) = p_{\mathrm{Unif}[0,1]}(U) $. Thus we can choose $p(U_{i,\ell}|S_{i,\ell}) = p_{\mathrm{Beta}[\alpha_j,\beta_j]}(U_{i,\ell})$, which will lead to $\sum_{j} p(S_{i,\ell} = j) p_{\mathrm{Beta}[\alpha_j,\beta_j]}(U_{i,\ell}|S_{i,\ell}=j) = p_{\mathrm{Unif}[0,1]}(U_{i,\ell}) = p(U_{i,\ell})$.  

However, we still notice some potential issues with this approach, first the intervals have more overlap than what we want,  which will be the case if $k$ is relatively small, and second this approach doesn't guarantee a support for all the regions in the current split. 
To deal with the first concern, we propose adding a sharpness scale ($\sharpness$) that acts to sharpen the edges of the distribution (i.e., making the splits more vertical). The sharpness scale identifies the number of adjacent Beta distributions to mix together for a single split distribution. 
Specifically, if we use $\sharpness \cdot k$ to be the total number of Beta distributions, then we can let each of the split distributions be mini mixtures of adjacent distributions, i.e., $\pbetamix(U_{i,\ell}|S_{i,\ell} = j ) = \frac{1}{\sharpness}\sum_{a=1}^{\sharpness} p_{\mathrm{Beta}[\alpha_{j,a}, \beta_{j,a}]}(U_{i,\ell})$, where $\alpha_{j,a} = (j - 1)\sharpness + a$ and $\beta_{j,a} = \sharpness k + 1 - \alpha_{j,a}$.
Using this setting will increase the sharpness of the splits, there by decreasing the regional overlap between the adjacent splits. As $\sharpness$ increases we combine more Beta distributions together, and this will lead to a more concentrated and refined  edges of the distribution. %, which also controls how much will each split overlap with the other. 
In the limit as $\sharpness$ goes to infinity, we would recover the quantile splits---thus, this can be seen as relaxation of quantile-based splitting. For the second issue due to zero or near zero support, while our smoothed quantiles can alleviate this somewhat, the support may still be near-zero in certain regions. 
Thus there will be no samples in training corresponding to the held-out samples. We approach this by mixing our previously defined mixture of the Beta distributions with the uniform distribution as follows: 
\begin{equation}
\label{eqn:dist_choice}
p(U_{i,\ell}|S_{i,\ell}) = (1-\epsilon) \pbetamix(U_{i,\ell}|S_{i,\ell}) + \epsilon \cdot p_{\mathrm{Unif}[0,1]}(U_{i,\ell})
\end{equation}
where $\epsilon \in [0,1]$ is the mixing parameter. This addition will ensure a minimum representation of each region of support in our split, and if $\epsilon=1$, we will get completely random splitting which is similar to standard CV splitting. We summarize our whole splitting procedure in \autoref{fig:scv-illustration}, and illustrate the effects of different parameters on our splits in \autoref{fig:explain_all}. Additional figures for illustrating the Beta related splits are in \autoref{sec:illustrate_beta_splits}. We also prove  \autoref{thm:splits}, which states that this current choice of $p(U_{i,\ell}|S_{i,\ell})$ will yield biased splits, in \autoref{sec:proofs}.


\begin{restatable}{proposition}{thmone}
\label{thm:splits}
For any $\epsilon < 1$ and $\sharpness \in \{1,2,\dots\}$ and assuming the splits are equal size in expectation, i.e., $p(S_{i,\ell})=\frac{1}{k}$, if
$p(U_{i,\ell}|S_{i,\ell}) 
    =(1\!-\!\epsilon)\pbetamix(U_{i,\ell}|S_{i,\ell}) + \epsilon p_{\mathrm{Unif}[0,1]}(U_{i,\ell})\,,
$
where
\begin{align*}
    \pbetamix(U_{i,\ell}|S_{i,\ell}\!=\!j) 
    &=\frac{1}{\sharpness}\sum_{a=1}^\sharpness p_{\mathrm{Beta}[\alpha_{j,a},\beta_{j,a}]}(U_{i,\ell}) 
\end{align*}
and where $\alpha_{j,a} \triangleq (j-1) \sharpness + a$ and $\beta_{j,a} \triangleq \sharpness k + 1 - \alpha_{j,a}$,

then $p(U_{i,\ell})=\mathrm{Uniform}[0,1]$ and the splits will be biased, i.e., $p(S_{i,\ell}|G_i) = p(S_{i,\ell}|U_{i,\ell}) \neq p(S_{i,\ell})$ or equivalently $I(S_{i,\ell}, G_i) > 0$.

\end{restatable} 

\iffalse
\begin{proof}

In the proof below we suppress the dependency on $i$ and $\ell$ for simplicity.
To prove that $P(U_{i,\ell}) =\mathrm{Uniform}[0,1]$ we first prove that this is true for $ p_{Beta,\sharpness} (U | S) $ by marginalizing over the joint distributions of $U_{i,\ell}$ and $S_{i,\ell}$ to get:


\begin{align}
    & \sum_{j = 1}^k p_{Beta,\sharpness} (S = j , U) \\
    & = \sum_{j = 1}^k p(S=j) p_{Beta,\sharpness}(U|S=j) \\
    & = \sum_{j = 1}^k P(S=j) \frac{1}{\sharpness} \sum_{i' = 1}^{\sharpness} p_{\mathrm{Beta}[\alpha_j\sharpness+i',\beta_j\sharpness-i']}(U) \\
    & = \frac{1}{\sharpness K} \sum_{j = 1}^k \sum_{i' = 1}^{\sharpness} p_{\mathrm{Beta}[\alpha_j\sharpness+i',\beta_j\sharpness-i']}(U) \\
    & = \frac{1}{\sharpness K} \sum_{j = 1}^k \sum_{i'= 1}^{\sharpness} p_{\mathrm{Beta}[j\sharpness+i',k\sharpness + 1 - (j\sharpness +i')]}(U) \\
    \intertext{Let n = $k \sharpness$, $r = j\sharpness+i'$ then we can rewrite as:}
    & = \frac{1}{n} \sum_{r = \sharpness + 1}^{n + \sharpness} p_{\mathrm{Beta}[r,n + 1 - r]}(U) \\
    \intertext{Rewriting to start the summation at 0, we get:}
    & = \frac{1}{n} \sum_{r = 1}^n p_{\mathrm{Beta}[r + \sharpness,n + 1 - (r + \sharpness)]}(U) \\
    & = p_{\mathrm{Unif}[0,1]}(U)
\end{align}

Eqn(8) is similar to \cite{segers2017empirical} results in section 2.1 if we substitute $d = 1$ and $n = K * \sharpness$ in their results, and use that to arrive to the last equality leading to Eqn(9).



Next, we show that this holds true for $P(U_{i,\ell}|S_{i,\ell})$ as follows:
\begin{align}
&\sum_{j=1}^k p_{\epsilon,\sharpness}(S_{i,\ell}=j,U_{i,\ell}) \\
&= \sum_{j = 1}^k p(S_{i,\ell}=j) p_{\epsilon,\sharpness}(U_{i,\ell}|S_{i,\ell} = j) \\
&= \sum_{j = 1}^k p(S_{i,\ell}=j) [(1\!-\!\epsilon)p_\mathrm{Beta,\sharpness}(U_{i,\ell}|S_{i,\ell}) + \epsilon p_{\mathrm{Unif}[0,1]}(U_{i,\ell})]\\
&= \epsilon  p_{\mathrm{Unif}[0,1]}(U_{i,\ell}) + \frac{(1-\epsilon)}{k} \sum_{j = 1}^k p_\mathrm{Beta,\sharpness}(U_{i,\ell}|S_{i,\ell})\\
&= p_{\mathrm{Unif}[0,1]}(U_{i,\ell})\,.
\end{align}

Invoking the previous results of Eqn (9), we see that the above is also uniform.


To prove that the splits will be biased, we will use mutual information as follows:
Let $p(U,S)$ denote the joint distribution of $p(U_{i,\ell},S_{i,\ell})$ where we suppress the dependence on $i$ and $\ell$ for simplicity.
The mutual information can be written as the following:
\begin{align}
    I(U,S) 
    &\equiv \mathrm{KL}(p(U,S), p(U)p(S)) \\
    &= \E_{p(S)}[\E_{p(U|S)}[\log \frac{p(U,S)}{p(U)p(S)}]] \\
    &= \E_{p(S)}[\E_{p(U|S)}[\log \frac{p(U|S)p(S)}{p(U)p(S)}]] \\
    &= \E_{p(S)}[\E_{p(U|S)}[\log \frac{p(U|S)}{p(U)}]] \\
    &= \E_{p(S)}[\mathrm{KL}(p(U|S), p(U))]] \\
    &>0 \,,
\end{align}
where the last inequality is because $p(U|S=j) \neq p(U), \forall j$ and thus the KL must be positive for all terms in the expectation.
Additionally, if $\epsilon=0$, since $p(U) = p_{\mathrm{Unif}[0,1]}$, we also have that the KL terms are equal to negative differential entropy of the Beta distributions which is known in closed form, i.e.,
\begin{equation}
    \begin{aligned}[b]
    &\mathrm{KL}(p_{\mathrm{Beta}[\alpha,\beta]}, p_{\mathrm{Unif}[0,1]}) \equiv -H(p_{\mathrm{Beta}[\alpha,\beta]}) \\
    &=-[\log \mathrm{B}(\alpha,\beta) - (\alpha-1)\gamma(\alpha)- (\beta-1)\gamma(\beta) + \\
    &(\alpha+\beta-2)\gamma(\alpha + \beta)] \,,
\end{aligned}
\end{equation}

where $\mathrm{B}(\cdot,\cdot)$ denotes the the Beta function and $\gamma(\cdot)$ denotes the digamma function.

Furthermore, if $0\leq \epsilon \leq 1$ and we consider the term $\mathrm{KL}((1-\epsilon)p_{\mathrm{Beta}[\alpha,\beta]} + \epsilon p_{\mathrm{Unif}[0,1]}, p_{\mathrm{Unif}[0,1]})$ which we will refer to as $\mathrm{KL}_{\textnormal{full terms}}$, we know that at $\epsilon = 1$ the term becomes $\mathrm{KL}( p_{\mathrm{Unif}[0,1]}, p_{\mathrm{Unif}[0,1]}) = 0$, and on the other extreme at $\epsilon = 0$, it becomes $\mathrm{KL}(p_{\mathrm{Beta}[\alpha,\beta]}, p_{\mathrm{Unif}[0,1]})$ which is known in closed form by the result above. Thus for any $0 < \epsilon < 1$, we are guaranteed to have: $0 < \mathrm{KL}_{\textnormal{full terms}} < -H(p_{\mathrm{Beta}[\alpha,\beta]})$.
Thus, it is possible to explicitly compute the mutual information and adjust $\sharpness$ or $\epsilon$ to match a target mutual information upperbound.


\mai{attempt 5/17}

Furthermore we are able to use the convex property of the KL divergence to bound the full term of $P(U|S)$ as follows:
\begin{align}
 & \mathrm{KL}((1-\epsilon)p_{\mathrm{Beta}[\alpha,\beta]} + \epsilon p_{\mathrm{Unif}[0,1]}, p_{\mathrm{Unif}[0,1]})\\
 & \mathrm{KL}((1-\epsilon)p_{\mathrm{Beta}[\alpha,\beta]} + \epsilon p_{\mathrm{Unif}[0,1]}, (1 - \epsilon) p_{\mathrm{Unif}[0,1]} +  \epsilon p_{\mathrm{Unif}[0,1]})\\
 & \leq \epsilon \mathrm{KL}(p_{\mathrm{Beta}[\alpha,\beta]},p_{\mathrm{Unif}[0,1]}) + (1-\epsilon) \mathrm{KL}(p_{\mathrm{Unif}[0,1]},p_{\mathrm{Unif}[0,1]})\\
 & = - \epsilon H(p_{\mathrm{Beta}[\alpha,\beta]})
\end{align}

\mai{end attempt 5/17}

Thus, it is possible to explicitly compute the mutual information and adjust $\sharpness$ or $\epsilon$ to match a target mutual information. \mai{A small value for mutual information indicates more bias, while larger values indicate less bias?}

\mai{Old stuff start here}

\david{
The uniform part is simple to derive using the Segers paper.
Basically just marginalize over the joint distribution of S and U given in the proposition.

For the second part, mutual information is the way to prove it.
Let $p(U,S)$ denote the joint distribution of $p(U_{i,\ell},S_{i,\ell})$ where we suppress the dependence on $i$ and $\ell$ for simplicity.
The mutual information can be written as the following:
\begin{align}
    I(U,S) 
    &\equiv \mathrm{KL}(p(U,S), p(U)p(S)) \\
    &= \E_{p(S)}[\E_{p(U|S)}[\log \frac{p(U,S)}{p(U)p(S)}]] \\
    &= \E_{p(S)}[\E_{p(U|S)}[\log \frac{p(U|S)p(S)}{p(U)p(S)}]] \\
    &= \E_{p(S)}[\E_{p(U|S)}[\log \frac{p(U|S)}{p(U)}]] \\
    &= \E_{p(S)}[\mathrm{KL}(p(U|S), p(U))]] \\
    &>0 \,,
\end{align}
where the last inequality is because $p(U|S=j) \neq p(U), \forall j$ and thus the KL must be positive for all terms in the expectation.
Additionally, because $p(U) = p_{\mathrm{Unif}[0,1]}$, we also have that the KL terms are equal to negative differential entropy of the Beta distributions which is known in closed form, i.e.,
\begin{align}
    &\mathrm{KL}(p_{\mathrm{Beta}[\alpha,\beta]}, p_{\mathrm{Unif}[0,1]}) \equiv -H(U) \\
    &=-[\log \mathrm{B}(\alpha,\beta) - (\alpha-1)\gamma(\alpha) - (\beta-1)\gamma(\beta) + (\alpha+\beta-2)\gamma(\alpha + \beta)] \,,
\end{align}
where $\mathrm{B}(\cdot,\cdot)$ denotes the  and $\gamma(\cdot)$ denotes the digamma function.
Thus, it is possible to explicitly compute the mutual information and adjust $\sharpness$ or $\epsilon$ to match a target mutual information.
}

From our first constraint, we defined a split distribution that is dependant on the graph such that $P(S_{i,\ell}|G_i) = P(S_{i,\ell}|U_{i,\ell}) \neq P(S_{i,\ell})$ by definition.

From our second constraint, we rewrote the conditional distribution via Bayes rule to be 

$p(S_{i,\ell} | U_{i,\ell}) = \frac{p(S_{i,\ell})p(U_{i,\ell} | S_{i,\ell})}{p(U_{i,\ell})}$

where $P(U_{i,\ell})$ is uniform by definition of the CDF of $Z_{\ell}$, and where we choose  $p(S_{i,\ell}) = 1/k$ (Where $k$ is the number of desired splits) and $p(U_{i,\ell} | S_{i,\ell})$ is chosen as described above. 

For the latter choice, the first component: $(1-\epsilon) p_{\mathrm{Beta}}(U_{i,\ell}|S_{i,\ell})$ will have a non zero contribution towards $p(U_{i,\ell}|S_{i,\ell})$ as long as $\epsilon < 1$.

So we'll prove that $p_{\mathrm{Beta}}(U_{i,\ell}|S_{i,\ell})$ will yield biased splits below:

$\because p_{\mathrm{Beta}[\alpha_j,\beta_j]}(U_{i,\ell}) \neq P(U_{i,\ell})$

Substituting that in the Bayes decomposition yields

$\therefore P(S_{i,\ell}|U_{i,\ell}) \neq P(S_{i,\ell})$


\end{proof}
\fi




\begin{figure*}[ht]
%\vspace{-0.8em}
\centering
  \begin{subfigure}{0.18\textwidth}
    %\includegraphics[width=\linewidth]{figures/sharp_1_eps_0.PNG}
    \includegraphics[width=\linewidth]{figures/stacked_1_0.png}
    \caption{$\sharpness = 1, \epsilon = 0$} \label{fig:justbeta}
  \end{subfigure}%
  \hspace*{\fill}   % maximize separation between the subfigures
  \begin{subfigure}{0.18\textwidth}
    %\includegraphics[width=\linewidth]{figures/sharp_10_eps_0.PNG}
    \includegraphics[width=\linewidth]{figures/stacked_10_0.png}
    \caption{$\sharpness = 10, \epsilon = 0$} \label{fig:beta_sharp}
  \end{subfigure}%
  \hspace*{\fill}   % maximizeseparation between the subfigures
  \begin{subfigure}{0.18\textwidth}
    %\includegraphics[width=\linewidth]{figures/sharp_10_eps_01.PNG}
    \includegraphics[width=\linewidth]{figures/stacked_10_01.png}
    \caption{$\sharpness = 10, \epsilon = 0.1$} \label{fig:beta_sharp_eps}
  \end{subfigure}
\hspace*{\fill}
   \begin{subfigure}{0.18\textwidth}
    %\includegraphics[width=\linewidth]{figures/sharp_1000_eps_0.PNG}
    \includegraphics[width=\linewidth]{figures/stacked_1000_0.png}
    \caption{$\sharpness = 1000, \epsilon = 0$}\label{fig:quantile_like}
  \end{subfigure}
\hspace*{\fill}
   \begin{subfigure}{0.18\textwidth}
    %\includegraphics[width=\linewidth]{figures/sharp_1_eps_1.PNG} 
    \includegraphics[width=\linewidth]{figures/stacked_1_1.png}
    \caption{$\sharpness = 1, \epsilon = 1$} \label{fig:cv_like}
  \end{subfigure}
  %\vspace{1.em}
  \caption{
  The Figure showcases stacked conditional split probabilities $p(S_{\ell}|U_{\ell})$ obtained using \name with different sharpness $\sharpness$ and uniform mixing parameter $\epsilon$.
  In (a), the split distributions may overlap too much when $\sharpness=1$.
  In (b), the splits are sharper but still smooth.
  In (c), the uniform mixing parameter allows all splits to have some support.
  In (d) and (e), we demonstrate the extremes of our approach that yield either quantile splits as $\sharpness \to \infty$ or uniform splits as in standard CV if $\epsilon = 1$.
  }
  \label{fig:explain_all}

  \end{figure*}
  



\subsection{Step 2: Defining a Meta-metric to Adjust For Shifted Splits} \label{sec:weights}

The second step in our approach is re-weighting the samples generated by our model to account for initially training the model with a biased dataset. The generative model ($\Omega$) was initially trained with $\Xtrain^{(\ell,j)}$ to produce samples $\Xgen^{(\ell,j)}$ (whose true distribution is $\qGbar$). Those generated samples -or more precisely the properties of such generated samples which we can denote by $\bar{Z}$- will follow a similar distribution to that of the training dataset (assuming that the model doesn't underfit), but will be different from the distribution of the held-out samples $\Xtest^{(\ell,j)}$. We will refer to the true distributions of a property $\ell$ in $\Xtrain^{(\ell,j)}$ and $\Xtest^{(\ell,j)}$ as $p(Z_{i,\ell}|S_{i,\ell} \neq j) $ and $p(Z_{i,\ell}|S_{i,\ell} = j)$ respectively, and to that of the generated samples as $q(\bar{Z}_{i,\ell}|\theta=\theta_{\Omega(\Xtrain^{(\ell,j)})}^*)$.

To account for such a shift in distribution, we need to unbias the generated samples, and one way of doing so is by re-weighting them, thus we define the importance weight of each sample $\Gbar_i$ in $ \Xgen^{(\ell,j)}$ with property $\bar{Z}_{i,\ell}$ as follows: $W^{(\ell,j)}(\Gbar_i) :=$

\begin{equation}
\begin{split}
\label{eqn:wts}
 W^{(\ell,j)}(\bar{Z}_i= h(\bar{G_i})) :=  \frac{p(\bar{Z}_{i,\ell}|S_{i,\ell}=j)}{q(\bar{Z}_{i,\ell}|\theta=\theta_{\Omega(\Xtrain^{(\ell,j)})}^*)}
\end{split}
\end{equation}
\\

Using the definition above, we can extend our graph multi sets to a weighted version where $\mathcal{G}_W = \ldblbrace (G_i, W(G_i)): G_i \in \mathcal{G} \rdblbrace$, so for $\Xgen$ we will have a corresponding $\mathcal{G}_{\sgen,W}$. Furthermore, let $\phi$ be a metric that can compare two distributions and handle weighted samples, we can define our meta-metric as:


\begin{align}
    \phi_{\textnormal{\name}}(\Xheldw, \mathcal{G}_{\sgen,W}^{(\ell, j)}; \phi) = \phi(\Xheldw,\mathcal{G}_{\sgen,W}^{(\ell,j)})
\end{align}

For two-sample tests, we will use an estimate of the number of effective samples based on \cite[Sec. 12.4]{monahan2011numerical} defined as:
\begin{equation}
    N_{\text{eff}}(\mathcal{G}_{\sgen,W}^{(\ell, j)})= \frac{(\sum_{(\Gbar_i, W(\Gbar_i)\in \mathcal{G}_{\sgen,W}^{(\ell, j)}} W^{(\ell,j)}(\Gbar_i))^2}{\sum_{(\Gbar_i, W(\Gbar_i)\in \mathcal{G}_{\sgen,W}^{(\ell, j)}} W^{(\ell,j)}(\Gbar_i)^2}.
    \label{eqn:neff}
\end{equation}




\subsubsection{Concrete Instantiation of the Metric via KS Statistics}
As one instantiation for our generic framework, we can choose $\phi$ to be a weighted version of a two-sample KS statistic.  One of the reasons for choosing the KS statistic is that its values are always between 0 and 1, and its weighted version can handle weighted samples from two distributions. Given any two weighted graph datasets $\cG_{1,W_1}$ and $\cG_{2,W_2}$. The empirical weighted KS statistic of a specific graph property function $h(\cdot)$ with $Z_{i} = h(G_i)$ can be defined as follows:


\begin{align}
%\begin{aligned}
    &\phi_{\text{KS}}(\cG_{1,W_1}, \cG_{2,W_2}; h) = \sup_{G_i} |\hat{F}_{W1}(h(G_i)) - \hat{F}_{W2}(h(G_i)) | \notag \\
    &\quad= \sup_{Z_{i}} |\hat{F}_{W1}(Z_{i}) - \hat{F}_{W2}(Z_{i})|,
\label{eq:phi_KS}
%\end{aligned}
\end{align}
where $\hat{F}_{W1}(z) = \frac{1}{\sum_i W_i} \sum_i W_i {\bf 1}_{(Z_{i} \leq z)}$ is the weighted empirical CDF associated with $\cG_{1,W_1}$ and similarly for $\hat{F}_{W2}(z)$ associated with $\cG_{2,W_2}$. 
For our specific case, we want to obtain that metric between our unweighted  $\Xheldw$ dataset (we can assume it's re-weighted by ones) and the re-weighted generated samples $\cG_{\sgen,W}^{(\ell,j)}$ with the chosen graph property function $h_{\ell'}$ we would like to evaluate on, therefore we'll have:
\begin{equation}
    \phi_{\textnormal{\name}}(\Xtest^{(\ell, j)}, \Xgen^{(\ell, j)};  \phi_{\text{KS}}, h_{\ell'}) = \phi_{\text{KS}}(\Xheldw,\cG_{\sgen,W}^{(\ell,j)};h_{\ell'}),
    \label{eqn:phi_ks}
\end{equation}
where $\ell' \in \{1,..,m\}$ and where $\ell' \neq \ell$. 
For an illustration of our unbiasing and re weighting procedure refer to \autoref{fig:scv-illustration-2} (A full version of the figure is presented in \autoref{sec:AppB}). We also prove that this chosen metric will converge to zero if the model generalizes well with the rate of convergence depending  on the number of samples in the dataset.  
Below we state the theorem, while the full proof is presented in \Cref{sec:proofs}. %\autoref{proofs}.

\begin{restatable}[$\phi_{\text{KS}}(\Xheldw, \cG_{\textnormal{gen},W}^{(\ell,j)}; h_{\ell'})$ consistent]{theorem}{thmtwo}
\label{thm:thmone}
Using \name for generating data splits and corresponding datasets $\Xtrain^{(\ell,j)}$, $\Xtest^{(\ell,j)}$ and using an implicit generator $\Omega$ trained on $\Xtrain^{(\ell,j)}$ to generate data $\Xgen^{(\ell,j)}$.

Then, if $\Xgen^{(\ell, j)}$ is generated with the same distribution as $\Xtest^{(\ell, j)}$,  
for any $\epsilon \in [0,1]$,


\begin{equation}
    \begin{aligned}[b]
         P(&\phi_{\text{KS}}(\Xheldw, 
     \cG_{\textnormal{gen},W}^{(\ell,j)}; h_{\ell'})  >  \epsilon)  \leq \\ &4 \exp\left(- 2  \min(|\Xgen^{(\ell, j)}| , |\Xtest^{(\ell, j)}|) \left(\frac{\epsilon}{2}\right)^2  \right),
        \end{aligned}
\end{equation}

\end{restatable} 



\paragraph{Implementation of \fullname}
The comprehensive implementation details are presented in \autoref{sec:implementation-other}, but we summarize a few key points here. First, we use a smoothed version of the empirical CDF instead of the true CDF of graph properties. Second, we Kernel Mean Matching (KMM) \citep{Huang2006CorrectingSS} to estimate the importance weights. Lastly, to sample from the generative model, we iteratively generate samples until the count of effective samples reaches a predetermined fixed number.


