\onecolumn
\section{Review Response}

\textbf{R1}
\begin{enumerate}
    \item Presentation: It's not very obvious to me how the two desired qualities of the splitting method could lead to biased splits. Could you elaborate on this?
    \item Is there a theoretical guarantee that the splits created using the method presented in Section 3.1 will be biased as wished?
    \item In the image generation domain, are there any evaluation methods/metrics which have considered images in thin support regions? Would it be possible to adapt those methods/metrics to graph generative model evaluation?
    \item The paper is motivated by using graph generative models to design or discover new molecules for medicine or material design. It would be useful to demonstrate in the experiments by showing some examples of new molecules generated or discovered by graph generative models which may have potential for medicine or material design.
\end{enumerate}


\paragraph{Answer:}
We thank the reviewer on their comments and valuable feedback as well as the time they spent reviewing our work.
\begin{enumerate}
    
    \item Thank you for the insightful question! The key objective of our splitting method was to create biased splits, which means that the split variable depends on the graph, i.e., $P(S_{i,\ell} | G_i) \neq P(S_{i,\ell})$. While clearly many distributions of $P(S_{i,\ell}|G_i)$ could give biased splits, we wanted both a generic and balanced splitting method. Thus, the two qualities/properties specify the constraints rather than the objective of our splitting method. We apologize for this confusion and will clarify more in the paper.
    Concretely, the first quality constrains the space of distributions to those that only depend on $U_{i,\ell}$, which is a function of $G_i$. Because $U_{i,\ell}$ encodes only the normalized rank information of the split property, it can generically be applied to any property distribution. The second quality ensures that the splits are balanced, which is a quality that is desirable for most train-test splitting techniques. In our case,
    this is more challenging to enforce. However, using Bayes rule, we show that because $U_{i,\ell}$ will have a uniform distribution by definition of the CDF of $Z_{\ell}$, we can simplify the constraint to finding a component distribution whose mixture is a uniform distribution and whose weights are equal to $p(S_{i,\ell})$.
    This leads us---later in the section---to using specific Beta distributions as components because their  mixture is equal to a uniform distribution (and thus will lead to balanced splits).
    Assuming that the mixture components are not equal, it is simple to show that $p(S_{i,\ell} | G_i) =  p(S_{i,\ell} | U_{i,\ell}) = \frac{p(S_{i,\ell})p(U_{i,\ell}|S_{i,\ell})}{p(U_{i,\ell})}  \neq p(S_{i,\ell})$ because $p(U_{i,\ell}|S_{i,\ell}) \neq p(U_{i,\ell})$ except in the special case where $p(U_{i,\ell}|S_{i,\ell})$ is uniform given any split (which corresponds to standard random splitting).
    Therefore, given all this and our final choice for $p(U_{i,\ell}|S_{i,\ell}) = (1-\epsilon) p(U_{i,\ell}|S_{i,\ell}) + \epsilon . p_{\mathrm{Unif}[0,1]}(U_{i,\ell})$ our splitting distributions will be biased as long as $\epsilon < 1$. We will add discussion on this point in the final paper if accepted.
    \item 
    
    We will attempt to provide a rough sketch of a theoretical guarantee that our method produces biased splits. Building up from our answer to your first question, the first quality is satisfied by construction since our splits only depend on $U_{i,\ell}$ (which can be viewed as a function of $G_{i,\ell}$). The second quality holds true per our answer to your first question. Finally, our final choice of $p(U_{i,\ell}|S_{i,\ell}) = (1-\epsilon) p(U_{i,\ell}|S_{i,\ell}) + \epsilon . p_{\mathrm{Unif}[0,1]}(U_{i,\ell})$ allow us to control for the amount of bias we want by adjusting the $\epsilon$ hyperparameter such that if $\epsilon < 1$ we are going to get some bias in our splits. We hope this helps clarify the theory and will include more formal proofs in the final version of the paper if accepted.%Referring to \autoref{fig:explain_all} in the main paper, if we set $\sharpness = 1$ and $\eps = 1$, then our splits become close to the random non-biased splitting scheme.
    \item That is an interesting question. We are not aware of any such metrics but would be happy to hear of any. The general idea of adapting metrics from the image generation domain to the graph generation domain has been previously explored. For example, Preuer et al. [1] introduced the FCD metric that draws motivation from the FID metric introduced by Heusel et al. [2] in the image generation domain. However, the FCD metric is designed to assess the quality of the molecular data generated by a generative model and isn't a general purpose metric for assessing the quality of all graph structures nor is it designed to evaluate generalizability on thin support. If there exists a metric for assessing the generalization qualities of an image generative model, it could (in theory) be adapted to the graph generation domain. However, we note that image generation is much more developed than graph generation and some metrics may not be easy (or even possible) to adapt. % due to the discrete nature of graphs.
    \item 
    %\david{I think we should just start by saying something like: "In [URL], we visualize the highest weighted novel molecules from GDSS and DiGress. It is ...[discuss the results in words here (but copy the words into the PDF as well)].}
    In [URL]\url{} we attempted to visualize the top four highest weighted novel molecules generated by both GDSS and DiGress from two training scenarios, where in both scenarios the chosen split property is average degree, but for the first scenario the chosen split to hold out is the first while in the second scenario the chosen split to hold out is the last one. We also show where the top four weighted novel molecule lie in the space of the train-held distribution of the average degree for both models and both scenarios. Finally, we show that among the top 100 highest weighted generated molecules, several were found to exactly match those in the held out data (exact numbers in the link), which suggests that these molecules are viable novel molecules.
    While we hope this helps contextualize our work, assessing whether a generated molecule has potential for medicine or material design is beyond the scope of this paper and the authors' expertise.
    %The goal of this paper was to introduce a metric for the graph generation domain that can be used when comparing whether model A has more potential at generalizing to thin support than model B. To exemplify this, we include some visualization for novel molecules obtained from DiGress and GDSS (trained using average degree as the split property and split 1 as the held out split) in the link below: \url{https://docs.google.com/document/d/e/2PACX-1vQ-EZwCGr3muqj5zD4C5fN0dcqJFixAN0RPCnMI2b0hX-poR7nsf38Z4RWOi4bXc63KY6FxwedF_JmU/pub}
%https://docs.google.com/document/d/1UoCmZKhHtKotcADYFafR3Ulpy3LgEBs_nwnWx1AV8vQ/edit?usp=sharing}. %MAI: will create an annon google account and copy over the doc
\end{enumerate}

[1] K. Preuer, P. Renz, T. Unterthiner, S. Hochreiter, and G. Klambauer. Fréchet chemnet distance: A metric for generative models for molecules in drug discovery. Journal of chemical information and
modeling, 2018.\newline
[2] M. Heusel, H. Ramsauer, T. Unterthiner,
B. Nessler, and B. Hochreiter. GANs trained
by a two time-scale update rule converge to a local Nash
equilibrium. NeurIPS 2017.\\

\textbf{R2}
\begin{enumerate}
    \item Another round of proof-reading would be beneficial (e.g. lowercase ks, diacritical signs).
    \item "Thick" and "thin" terms are a bit misleading from a mathematical point of view as they suggest the difference in dimensionality of the data manifold. Additionally, these terms should be properly defined in the first paragraphs. 
    \item The motivation part is somewhat vague, because the search for new molecules actually often targets "thick" regions of the data for practical reasons, and it is essentially a fast approximation of a combinatorial search on a set of pre-defined molecular fragments. When talking about the discovery of molecules with drastically different structure, this search is inherently out of distribution and not directly related to the proposed work. This motivation also contrasts with the experiments a little - discussion for Qm9 is in the appendix and does not feature molecules. As mentioned above, actual examples of generated graphs would improve the readability of the discussion part. 
    \item The overall technique appears to be more suitable for dropout-like training, wrapping it as evaluation method is somewhat awkward. Maybe some discussion around it would make the paper more organic. 
    \item A provided code could really help with trying out this method in practice.
    

\end{enumerate}

\paragraph{Answer:}
We thank the reviewer on their comments and valuable feedback as well as the time they spent reviewing our work.
\begin{enumerate}
    \item We agree and we are in the process of performing another round of proof reading and fixing typos. We will be sure to fix these in the final version.
    \item Thank you for pointing that out. We will work to more carefully define and explain those terms early on. % We are open to considering other terms that might be less confusing (for example: dense vs sparse) and will be sure to introduce those terms early on.
    \item Thank you for your insight. We agree on your point that novel molecule generation tasks may focus on graphs with similar structures, i.e., parts of the distributions where the marginal graph properties are similar to known molecules. However, we want to point out that "thin" support regions in the joint distribution could be hidden in the "thick" support of marginal distributions, especially when considering distribution in high dimensions. For example, consider samples on a 3D sphere. When projected onto almost any direction, it will look like the support is dense near zero. However, the distribution has no support at or near the all zero vector. Thus, we hypothesize that in high dimensional spaces, there are many thin support regions that are hidden. When we systematically create thin support regions using our approach, the goal is to measure the model's ability to generalize to thin support in general (including hidden thin support). Thus while aiming to generate for thin support can be unrealistic for some properties, we test the ability of the model to generate in those regions as this will reflect in its ability to generate in hidden thin support. It's also worth pointing out that we picked the molecule generation example in the introduction as it is a commonly used graph generation task that will allow us to better explain the general goals of our approach. However, the scope and main goal of our approach expands beyond that to include its use in evaluation (or model selection) with respect to generalization on thin support for any graph generating generative model regardless of the type of graphs generated (molecules, social network graphs, etc.). 
    
    Finally, we agree on the importance of providing some visual examples in the final version of the paper and we attempted to do so in the examples shown here [URL]. 
    We briefly described the experiments shown in the link in the answer to another reviewer's question, but we will copy the description here as well for completeness. [Copied answer begins here] We attempted to visualize the top four highest weighted novel molecules generated by both GDSS and DiGress from two training scenarios, where in both scenarios the chosen split property is average degree, but for the first scenario the chosen split to hold out is the first while in the second scenario the chosen split to hold out is the last one. We also show where the top four weighted novel molecule lie in the space of the train-held distribution of the average degree for both models and both scenarios. Finally, we show that among the top 100 highest weighted generated molecules, several were found to exactly match those in the held out data (exact numbers in the link), which suggests that these molecules are viable novel molecules.
    
    
    \item Thanks for bringing this to our attention, however we admit that we are unclear about what dropout like training refers to, we would appreciate it if you could clarify that. However, we will briefly attempt to provide a general overview of our approach and how it relates to evaluation. % Thank you, we will add a discussion on this topic. Briefly, 
    Step 1 of our approach creates biased train-held splits, where a portion of the samples is assigned to each split. We would say that this step is similar to cross validation (but different in the sense that CV is assigning samples to splits uniformly at random but we assigning samples to splits based on some calculated property of the sample). 
    %akin to dropout-like training in the sense that there is a part of the data that is dropped out, the way of "dropping" though is different, as our approach will "drop" part of the data based on the value of a computed property and not uniformly at random. 
    Then step 2 introduces a way that systematically evaluates the model using its produced samples in a way that accounts for the bias that we intentionally introduced in step 1.
    We hope this answer provides some clarification and we will be sure to include more details in the main paper. 
    \item The code for our approach is provided in the uploaded supplementary material in the file "propertysplit2.py". We also provide 2 notebooks that were used to generate the results in section 4.1 for validating the effectiveness of VV. Those notebooks also serve as examples for how to use the code in "propertysplit2.py" for any generative model. The graph generative models we used are all open sourced and available on github, we did however modify some aspects of their data loading capabilities (and other minor details) to be able to easily utilize our computed splits. We will be sure to publish all our modifications to a public repository should our paper be accepted. 
\end{enumerate}

\textbf{R3}
If the objective is to generate molecules with two optimized properties, and these two molecular properties are intrinsically correlated, possibly by an unknown complex nonlinear mapping, is the proposed splitting approach still viable?

\paragraph{Answer:}
We thank the reviewer for their insightful question!
In short, yes, the approach would still be viable.
In fact, our evaluation method is most effective when there is high dependency between graph properties because then our systematic shift along the split property will (implicitly) create marginal distribution shifts along the test properties that will be detectable via the KS statistic.
Please see the example in Figure 1 for data that lies on a circle. In this case, the two properties are intrinsically correlated via a complex non-linear mapping. The shifted splits along the x-axis as in Figure 1, yield different test marginal distributions along the y-axis. Therefore, indeed, highly dependent properties could be helpful for our evaluation method and would be perfectly viable. We will be sure to add this discussion to the final paper.
%On the other side, completely independent graph properties would mean that our systematic shift along the split property would not create any shift along other test properties and thus may be more difficult to detect.
%In fact, the lack of any correlation between the selected properties might be challenging for graph generation methods. If property 1 and 2 are completely independent, then splitting by conditioning on the value of property 1, might not result in the desired effect on property 2. The desired effect being that we want to create regions where certain values of property 2 are scarcely represented. That's an interesting point that we will include that in the discussion of the paper.

\textbf{R4} The proposed evaluation method is only applied to representative generative models. Can the authors apply this evaluation method to the latest (2023, 2024) generation models?Please show some experimental results if possible.

\paragraph{Answer:}
%Mai: I will look into that over the weekend if I got results by then we can also show results for an additional model
We thank the reviewer on their comments and acknowledging the organization and novelty of our approach. Concerning comparing the most recent graph generative models; due to the short time available for the rebuttal period and the time consuming nature of setting up and training a new model, and the fact that graph generative models are much less developed than image generative models with varying experimental setups and diverse code bases, we are unfortunately unable to provide results on additional models at the moment. 



