% \textit{\textbf{The following section formatting is \textbf{optional}, you can also define sections as you deem fit.
% \\
% Focus on what future researchers or practitioners would find useful for reproducing or building upon the paper you choose.\\
% For more information of our previous challenges, refer to the editorials \cite{Sinha:2022,Sinha:2021,Sinha:2020,Pineau:2019}.
% }}
\section{Introduction}
% A few sentences placing the work in high-level context. Limit it to a few paragraphs at most; your report is on reproducing a piece of work, you don’t have to motivate that work.

This paper \cite{Han:2022} develops $\mathcal{G}$-Mixup, a data augmentation technique that involves mixing up graph data using the theory of graphons. Past work on mixup and data augmentation has mostly focused on Euclidean data \cite{Zhang:2018} or within-graph data augmentation \cites{Rong:2020,You:2020}. This new method proposed by the authors allows for the augmentation of graph data across different classes of graphs. The authors of \cite{Han:2022} claim to provide theoretical and experimental evidence that the augmented data generated by $\mathcal{G}$-Mixup can improve the generalization and robustness of graph neural networks (GNNs), compared to both the classical setting (no data augmentation) and the setting in which other data augmentation techniques are used.



\section{Scope of reproducibility}
\label{sec:claims}
% A claim should be something that can be supported or rejected by your data. An example is, ``Finetuning pretrained BERT on dataset X will have higher accuracy than an LSTM trained with GloVe embeddings.''
% This is concise, and is something that can be supported by experiments.

% An example of a claim that is too vague, which can't be supported by experiments, is ``Contextual embedding models have shown strong performance on a number of tasks. We will run experiments evaluating two types of contextual embedding models on datasets X, Y, and Z."

In \cite{Han:2022}, the authors propose a new method for augmenting graph data, $\mathcal{G}$-Mixup, which they claim theoretically generates synthetic graphs that are mixtures of original graphs and improves the generalization and robustness of Graph Neural Networks (GNNs). In this report, we investigate the following two claims from \cite{Han:2022}:

\begin{itemize}
    \item Claim 1 (\textit{Theoretical mixing}): Theoretically, $\mathcal{G}$-Mixup produces synthetic graphs that have a mixture of the key graph topologies of source graphs coming from distinct classes.
    \item Claim 2 (\textit{Improved experimental performance of GNNs}): Augmenting graph data using $\mathcal{G}$-Mixup increases the test accuracy and decreases the test cross-entropy loss of a GNNs when compared to (a) a GNN without augmentation, and (b) a GNN with a different graph augmentation algorithm.
    
    % \item Claim 3: Using $\mathcal{G}$-mixup augmented data improves the robustness of GNNs, specifically in the context of corrupting graph data by dropping edges.
\end{itemize}

In the original paper, \textbf{Claim 1} is supported by \cite[Theorem 4.2]{Han:2022}, \cite[Theorem 4.3]{Han:2022}, and proofs in \cite[Appendix A]{Han:2022}. In Sections~\ref{result 1} and \ref{appendix: theory}, we provide a more detailed proof of one of the authors' key lemmas, as well as a corrected statement of the second theorem.
\\

\textbf{Claim 2} is supported by the experiments in \cite[Section 5.3]{Han:2022}, whose results are described in \cite[Table 2]{Han:2022}, and \cite[Figure 4]{Han:2022}. We reproduce some of these experiments for two of the authors original datasets, IMDB-B and REDDIT-B, and a new dataset which is not used by the authors, PROTEINS, in \Cref{section:gen-result}. We also compare $\mathcal{G}$-Mixup with another graph augmentation algorithm (graphon-based edge perturbation) for IMDB-B, REDDIT-B, PROTEINS.


\section{Methodology}

The authors provide their code publicly on Github.\footnote{The code from their Github is linked here: \href{https://github.com/ahxt/g-mixup}{https://github.com/ahxt/g-mixup}.} We reused the author's code from Github in a Google Colab notebook with a GPU. We implemented the GCN architecture, added a hyperparameter to modify graphon resolution, and added code to log the neural network statistics by epoch. Otherwise, we left the source code unchanged.


\subsection{Model descriptions}
The authors propose a new data augmentation method, $\mathcal{G}$-Mixup, for irregular, not-well-aligned graph data with divergent topology between graph classes, to which existing mixup methods are not directly applicable.  The $\mathcal{G}$-Mixup theory is based on the \textbf{graphon}, which is a continuous, bounded, and symmetric function from $[0,1]^{2}$ to $[0,1]$ and can be thought of as the weight matrix of a graph with infinite number of vertices.
\\

Broadly, the $\mathcal{G}$-Mixup model proposed in the paper has the following key steps: i) estimate a graphon, $\mathcal{W}_{\mathcal{G}}$, for each class of graphs $\mathcal{G}$; ii) mix up the graphons of different graph classes $\mathcal{G}$ and $\mathcal{H}$, as in (\ref{eqn:graphon-mixup}); and iii) generate synthetic graphs based on the mixed graphons and mix up the labels. 
In (\ref{eqn:graphon-mixup}), $\lambda$ is called the \textbf{mixup ratio/parameter}. When $\lambda = 0$, $\mathcal{G}$-Mixup becomes the graph data augmentation method \textbf{graphon-based edge perturbation} \cite{Hu:2021}.
\begin{equation}\label{eqn:graphon-mixup}
    W_{\mathcal{I}} = \lambda W_{\mathcal{G}} + (1-\lambda) W_{\mathcal{H}}
\end{equation}




To implement the graphon estimation step, the paper uses step functions to approximate graphon vertex features. The step function estimation methods are well-studied; they first align the vertices in a set of graphs based on degree and then estimate the step function from all the aligned adjacency matrices. The paper uses universal singular value thresholding (USVT) \cite{Chatterjee:2015} as the estimation method in their code, and we also use this estimator in our experiments. 
\\

The authors then mix up the estimated graphons as follows. For $n_{\text{aug}}$ cycles (this number is called the \textbf{augmentation number}), two graphons from the set of estimated graphons are randomly chosen. Then, synthetic graphs are generated from the mixed-up graphon. The number of synthetic graphs is controlled by the \textbf{augmentation ratio} $\alpha$, and the number of nodes for the synthetic graphs is controlled partially by the \textbf{graphon resolution}. More details on these parameters are described in Section~\ref{appendix:hyperparamets}.
\\

After obtaining the synthetic graph data, the paper uses the Graph Convolutional Network (GCN) \cite{Kipf:2017} and Graph Isomorphism Network (GIN) \cite{Xu:2019} neural networks for their experiments. See \cite[AF.2]{Han:2022} for the specific architectures of the GNNs used. For all of our experiments, we use the GCN model, as the paper's results using the GIN model do not provide statistically significant evidence to support their claims.


\subsection{Datasets}
%% our project
We tested the authors' experimental claims on the datasets IMDB-B, REDDIT-B, and PROTEINS, which are part of the TUDataset collection\footnote{The datasets can be downloaded from the TUDataset repository, linked here: \href{https://chrsmrrs.github.io/datasets/docs/datasets/}{https://chrsmrrs.github.io/datasets/docs/datasets/}.} \cite{Morris:2020}. The IMDB-B and REDDIT-B datasets are used in the paper while the PROTEINS dataset is not used in the paper. The relevant statistics for the datasets are shown in Table~\ref{tab:data-stats}. The edges of the graphs are unlabeled, and labels on vertices are ignored in our code.
%% link to datasets
% The datasets can be downloaded from the TUDataset repository, linked here \href{https://chrsmrrs.github.io/datasets/docs/datasets/}{https://chrsmrrs.github.io/datasets/docs/datasets/}.

%% data statistics
\begin{table}[h]
    \centering
    \begin{tabular}{c|ccc}
        Dataset & PROTEINS & REDD-B & IMDB-B \\
        \hline
        \# graphs & 1113 & 2000 & 1000\\
        \# classes & 2 & 2 & 2\\
        class priors & ~0.596/~0.404 & 0.5/0.5 & 0.5/0.5 \\
        \# avg. vertices & ~39.058 & 429.627 & 19.77\\
        \# avg. edges & ~72.816 & 497.794 & 96.53\\
        avg. density & ~0.0477 & ~0.0027 & 0.2469
    \end{tabular}
    \caption{Statistics for the PROTEINS, REDDIT-B (REDD-B), and IMDB-B datasets.}
    \label{tab:data-stats}
\end{table}

%% train/test/validation splits
\vspace{1em}

We used a 7:1:2 train-validation-test split, which is the same split used by the authors. The original authors' code does all the preprocessing necessary for the datasets through the PyTorch Geometric library.

\subsection{Hyperparameters}\label{subsec:hyperparameters}

In this subsection, we state the values used for hyperparameters related to $\mathcal{G}$-Mixup. We describe these hyperparameters further in Section~\ref{appendix:hyperparamets}. We set the augmentation ratio to be $\alpha=0.2$, mixup ratio/interval to be $\lambda\in [\Lambda_1,\Lambda_2]=[0.1,0.2]$, augmentation number to be $n_{\text{aug}}=10$, and the graphon resolution to be the median number of nodes in the training set. As an additional experiment, we also perform hyperparameter searches for the mixup ratio and graphon resolutions. The results are described in Section~\ref{section:hyperparam}.
\\

We now also state the hyperparameters for the GCN model architecture used by the original paper. We use the same values. There are $64$ hidden features, the activation function is ReLU, the batch size is $128$, the initial learning rate is $0.01$ and drops by half every $100$ epochs, and there are $4$ layers.

\subsection{Experimental setup and code}\label{subsec:experimental setup}

We used the authors' code on Github, and used Google Colab to perform our experiments.
The authors only had the GIN neural network implemented in their code, so we implemented the GCN neural network per their specifications. For IMDB-B we train for 100 epochs, for PROTEINS we train for 300 epochs, and for REDDIT-B we train for 500 epochs.\footnote{We determined the number of epochs by looking at output graphs for 1000 epochs, and cutting off training when the validation loss started to increase consistently.} The best test epoch is selected on a validation set with the loss function \textbf{mixup cross-entropy loss} defined below, and we report the test accuracy on $10$ runs.
\\

We use the above experimental setup to compare the baseline GCN model with the GCN model using $\mathcal{G}$-Mixup for the IMDB-B, REDDIT-B, and PROTEINS datasets (Sections~\ref{section:gen-result} and \ref{section:proteins}). In addition, we use this setup to do hyperparameter searches for the mixup ratio and graphon resolution (Section~\ref{section:hyperparam}); the search for the mixup ratio also gives a comparison of $\mathcal{G}$-Mixup with another graph data augmentation method.
\\

As the label of a graphon is a probability distribution, we define the \textbf{mixup cross‐entropy loss function} to be as in Equation~\ref{eqn:cross entropy}, where $p$ is our generated probability distribution and $q$ is our target probability distribution.
\begin{equation}-\sum_i q_i \log(p_i)\label{eqn:cross entropy}\end{equation}
Note that this definition subsumes that of cross-entropy loss.\footnote{This definition was first defined here: \href{https://github.com/moskomule/mixup.pytorch}{https://github.com/moskomule/mixup.pytorch}.} We also note that the original paper does not explicitly state how they define cross-entropy loss, but they use this definition in the code.

\subsection{Computational requirements}

For the PROTEINS and IMDB-B datasets, we use the free tier of Google Colab, which gives a Tesla T4 GPU with an average of $12$ GB RAM. It takes around $11$ minutes to run $10$ seeds of $500$ epochs of $\mathcal{G}$-Mixup on the GCN model.
\\

For the REDDIT-B dataset, we used the campus computing cluster, which gives a Tesla V100 GPU with up to $90$ GB RAM. It takes around $20$ minutes to run $10$ seeds with $500$ epochs with or without $\mathcal{G}$-Mixup.
\\

In terms of software, we use PyTorch Geometric $2.1.0$ and PyTorch $1.12.1$ on Python $3.7$. As of date, these are the defaults of Google Colab.


\section{Results}
\label{sec:results}

Our main results are summarized as follows.

\begin{itemize}
    \item We provide a more detailed proof of a key lemma of the paper and correct a constant in one of their theorems; these results study if the mixed-up graphon and the synthetic graphs sampled from the graphon preserve the ``key graph topologies" of the original graphs. 
    \item We find that $\mathcal{G}$-Mixup gives higher classification accuracy and lower test loss than the baseline GCN model for the REDDIT-B and PROTEINS datasets. For the IMDB-B dataset, we find that $\mathcal{G}$-Mixup gives lower test loss but not higher classification accuracy than the baseline GCN model. However, our results are not statistically significant. Our values for classification accuracy for the REDDIT-B and IMDB-B datasets are within about $6\%$ and $2\%$ respectively of the original paper's values.
    \item We find that varying the mixup ratio and graphon resolution hyperparameters does not greatly affect classification accuracy of $\mathcal{G}$-Mixup for the three datasets. Finally, we find that $\mathcal{G}$-Mixup does not outperform another graph data augmentation method (graphon-based edge perturbation) as measured by classification accuracy for the datasets.
\end{itemize}


\subsection{Results reproducing original paper}



\subsubsection{Result 1}\label{result 1}
We discuss the paper's theorems supporting Claim 1 of Section~\ref{sec:claims}, which is that the synthetic graph generated by $\mathcal{G}$-Mixup is a mixture of original graphs. In particular, we give a more detailed proof of \cite[Lemma A.2]{Han:2022} and correct their statement of \cite[Theorem 4.3]{Han:2022}. The original paper uses the notion of \textbf{discriminative motifs} to capture key graph topologies. The relevant definitions are provided in Section~\ref{appendix: theory}.
\\

In Section~\ref{appendix: theory}, we give a detailed proof of the following result, which is the key lemma used to prove Theorem~\ref{thm:4.2}. We note that the paper's original proof of this result is less than ten lines long.

\begin{lemma}[\cite{Han:2022} Lemma A.2]\label{lem:A.2}
Let $F$ be a simple graph and let $W,W'$ be two graphons. Let $e(F)$ denote the number of edges of $F$. Then,
\[|t(F,W)-t(F,W')|\leq e(F)\lVert W-W'\rVert_{\square}.\]
\end{lemma}



In the paper, Lemma~\ref{lem:A.2} is applied to prove the following main result. This result says that the difference in homomorphism densities of a given discriminative motif for graph class with respect to the graphon estimating that class and with respect to the mixed up graphon is upper-bounded by the cut norm between the graphons estimating the two distinct graph classes. This suggests that the mixed-up graphon indeed contains graph topologies of both classes of graphs that it is mixing up.

\begin{theorem}[\cite{Han:2022}, Theorem 4.2]\label{thm:4.2}
Let $\mathcal{G},\mathcal{H}$ be two sets of graphs with corresponding graphons $W_{\mathcal{G}},W_{\mathcal{H}}$ and corresponding discriminivative motif sets $\mathcal{F}_{\mathcal{G}},\mathcal{F}_{\mathcal{H}}$. Let $\lambda\in (0,1)$, and let $W_{\mathcal{I}}=\lambda W_{\mathcal{G}}+ (1-\lambda)W_{\mathcal{H}}$ be the mixed graphon. Then,
\begin{align*}
    |t(F_\mathcal{G},W_{\mathcal{I}})-t(F_\mathcal{G},W_{\mathcal{G}})|&\leq (1-\lambda) e(F_{\mathcal{G}})\lVert W_{\mathcal{H}}-W_{\mathcal{G}}\rVert_\square,\\
    |t(F_{\mathcal{H}},W_{\mathcal{I}})-t(F_\mathcal{H},W_{\mathcal{H}})|&\leq\lambda e(F_\mathcal{H})\lVert W_\mathcal{H}-W_\mathcal{G}\rVert_\square.
\end{align*}
\end{theorem}



The authors then state \cite[Theorem 4.3]{Han:2022}.
The purpose of this result is to show that the probability that the graph topology of a random graph sampled from a mixed-up graphon is very different from the graph topology of the mixed-up graphon can be made arbitrarily small, as long as the number of vertices of the random graph is sufficiently large. However, the proof of this result given is incorrect, as it incorrectly cites the following result of \cite{Lovasz:2006}.

\begin{lemma}[\cite{Lovasz:2006}, Theorem 2.5]\label{lem:L-S thm 2.5}
Let $W$ be a graphon, let $n\geq 1$, and let $0<\varepsilon<1$. Let $F$ be a simple graph. Then, the $W$-random graph $\mathbb{G}=\mathbb{G}(n,W)$ satisfies
\[\Pr(|t(F,\mathbb{G})-t(F,W)|>\varepsilon)\leq 2\exp\left(-\frac{\varepsilon^2n}{18 v(F)^2}\right).\]
\end{lemma}

The authors of \cite{Han:2022} have an $8$ instead of an $18$ in the denominator of the fraction on the right hand side of the inequality. Therefore, the correct statement of \cite[Theorem 4.3]{Han:2022} is the following.

\begin{theorem}[\cite{Han:2022}, corrected Theorem 4.3]\label{thm:4.3}
Let $W_{\mathcal{I}}$ be the mixed graphon, let $n\geq 1$, and let $0 < \varepsilon < 1$. Let $F_{\mathcal{I}}$ be the mixed discrminative motif. Then the $W_{\mathcal{I}}$-random graph $\mathbb{G}=\mathbb{G}(n,W_{\mathcal{I}})$ satisfies
\[\Pr(|t(F_{\mathcal{I}},\mathbb{G})-t(F_{\mathcal{I}},W_{\mathcal{I}})|>\varepsilon)\leq 2\exp\left(-\frac{\varepsilon^2n}{18 v(F_\mathcal{I})^2}\right).\]
\end{theorem}

\begin{proof}
We apply Lemma~\ref{lem:L-S thm 2.5}, with $F=F_{\mathcal{I}},\, W=W_{\mathcal{I}}.$
The corrected theorem statement then follows.
\end{proof}



\subsubsection{Result 2}\label{section:gen-result}

We now provide our experimental results comparing $\mathcal{G}$-Mixup's performance on the GCN architecture with a vanilla GCN network on the IMDB-B and REDDIT-B datasets. This experiment is related to Claim 2(a) of Section~\ref{sec:claims}, which is that augmenting graph data using $\mathcal{G}$-Mixup increases classification accuracy and decreases test loss of GNNs.
\\

Table~\ref{tab:our-table-2} shows the classification accuracy results for our experiments. The authors' original results for this experiment utilizing the entire REDDIT-B dataset is shown in Table~\ref{tab:table-2-copy}. Our results for the REDDIT-B dataset are within about $6\%$ of the authors' reported values, and our results for the IMDB-B dataset are within about $2\%$ of the authors' reported values.

\begin{table}[h]
    \centering
    \begin{tabular}{c|ccc}
        Method & PROTEINS & REDDIT-B & IMDB-B\\
        \hline
        vanilla & $58.52 \pm 3.11$ & $81.4 \pm 5.49$ & $\mathbf{73.15 \pm 2.5}$ \\
        w/ $\mathcal{G}$-Mixup & $\mathbf{64.66 \pm 5.06}$ & $\mathbf{84.8 \pm 4.75}$ & $71.3 \pm 3.3$
    \end{tabular}
    \caption{Our performance comparisons of $\mathcal{G}$-Mixup using the GCN architecture. The metric is classification accuracy.}
    \label{tab:our-table-2}
\end{table}

\begin{table}[h]
    \centering
    \begin{tabular}{c|cc}
        Method & REDDIT-B & IMDB-B \\
        \hline
        vanilla & $78.82 \pm 1.33$ & $72.18 \pm 1.55$ \\
        w/ $\mathcal{G}$-mixup & $\mathbf{89.81 \pm 1.70}$ & $\mathbf{72.87 \pm 3.8}5$ \\
    \end{tabular}
    \caption{Original performance comparisons of $\mathcal{G}$-Mixup using the GCN architecture. The metric is classification accuracy. This table is part of \cite[Table 2]{Han:2022}.}
    \label{tab:table-2-copy}
\end{table}

\vspace{1em}


Figure~\ref{fig:redbintraintestvallosscurves} in Section~\ref{appendix:loss figures} compares the loss for our experiment on the REDDIT-B dataset, and Figure~\ref{fig:imdbbintraintestvallosscurves} compares the loss curves for the IMDB-B dataset. The line depicts the mean loss over our ten runs, and the shading shows standard deviation. It appears that $\mathcal{G}$-Mixup performs better in terms of test loss for both datasets. However, the standard deviations overlap. The authors' original results for the REDDIT-B and IMDB-B dataset are shown in Figure~\ref{fig:paper_loss_fig}. We note that the authors do not explicitly state what the lines or shadings of their figures represent.



\subsection{Results beyond original paper}

\subsubsection{Additional Result 1}\label{section:proteins}

Using the hyperparameters specified in Section~\ref{subsec:hyperparameters} on the PROTEINS dataset, we tested the performance of the GCN model with and without $\mathcal{G}$-Mixup. This experiment is related to Claim 2(a) of Section~\ref{sec:claims}. We chose this dataset because we wanted to analyze the performance of $\mathcal{G}$-Mixup on an additional full dataset, and this dataset was small enough in terms of memory for us to implement our experiment.
\\

Table~\ref{tab:our-table-2} shows the classification accuracy results for our experiment in the first column. As in our experiment on the REDDIT-B dataset, the performance of the GCN model with the $\mathcal{G}$-Mixup procedure leads to higher classification accuracy than the baseline GCN model. 
\\

Figure~\ref{fig:proteinstraintestvallosscurves} in Section~\ref{appendix:loss figures} compares the loss for our experiments on the training, validation, and test splits of the PROTEINS dataset. It appears that $\mathcal{G}$-Mixup with the GCN architecture perform better in terms of loss. In fact, the vanilla GCN network does not seem to converge at all, as the classification accuracy for the training set does not appear to improve. 

\subsubsection{Additional Result 2}\label{section:hyperparam}

We did hyperparameter searches for the mixup ratio and graphon resolution on the three datasets. We chose to study these hyperparameters because according to \Cref{thm:4.2} and \Cref{thm:4.3}, the mixup ratio $\lambda$ and graphon resolution (which upper bounds the number of nodes for a synthetic graph) affect the probability that a synthetic graph generated from the mixed-up graphon contains a mixed-up version of original graph data's discriminative motifs.
\\

This claim is also related to claim 2(b): by setting the mixup ratio $\lambda=0$, $\mathcal{G}$-Mixup degenerates into another graph data augmentation method, \textbf{graphon-based edge perturbation} \cite{Hu:2021}. Thus, the hyperparameter search for the mixup ratio also allows us to compare the performance of $\mathcal{G}$-Mixup with a ``simpler" graph data augmentation method.
\\

Our results for the mixup ratio and graphon resolution are shown in Tables~\ref{tab:lam-search} and \ref{tab:res-search} respectively. Overall, it does not appear that varying either hyperparameter leads to significant changes in the classification accuracy of $\mathcal{G}$-Mixup.



{\small \setlength\tabcolsep{1.5pt} \begin{table}[h]
    \centering
    \begin{adjustwidth}{-1em}{-1em}
    \begin{tabular}{c|c|c|c|c|c|c}
        $\lambda$ & 0 & 0.001 & 0.01 & 0.1 & 0.5 & No mixup \\
        \hline
        PROTEINS & $62.69\pm 4.1$ & $62.69\pm 4.8$ & $62.91\pm 4.99$ & $61.84\pm 3.53$ & $63.36\pm 4.98$ & $58.52 \pm 3.11$ \\
        IMDB-B & $72.0 \pm 3.61$ & $73.55 \pm 2.24$ & $72.64 \pm 2.75$ & $73.8 \pm 3.14$ & $72.39 \pm 2.61$ &  $73.15 \pm 2.5$ \\
        REDDIT-B & $89.78 \pm 1.16$ & $89.20 \pm 1.48$ & $88.75 \pm 3.29$ & $87.82 \pm 3.87$ & $74.78 \pm 2.72$ & $81.4 \pm 5.49$
    \end{tabular}
    \caption{Average accuracy $\pm$ standard deviation over 10 seeds for varying mix-up ratios $\lambda$. The last column is using the baseline GCN model with no $\mathcal{G}$-Mixup.}
    \label{tab:lam-search}
    \end{adjustwidth}
\end{table}}


{\small \setlength\tabcolsep{1.5pt} \begin{table}[h]
    \centering
    \begin{adjustwidth}{-5em}{-5em}
    \begin{tabular}{c|c|c|c|c|c|c|c|c}
        $r$ & $-15$ & $-10$ & $-5$ & 0 & 5 & 10 & 15 & No mixup \\
        \hline
        PROTEINS & $61.84\pm3.55$ & $63.86\pm 4.12$ & $63.41\pm3.21$ & $62.47\pm5.09$ & $62.69\pm3.72$ & $62.96\pm2.98$ & $63.81\pm3.32$ & $58.52 \pm 3.11$ \\
        IMDB-B & $73.34 \pm 2.26$ & $74.05 \pm 1.98$ & $73.14 \pm 2.49$ & $73.5 \pm 2.99$ & $73.5 \pm 3.59$ & $73.8 \pm 2.47$ & $73.75 \pm 2.74$ & $73.15 \pm 2.5$ \\
        REDDIT-B &  $83.15 \pm 5.32$ & $84.73 \pm 4.78$ & $82.78 \pm 6.51$ & $84.8 \pm 4.75$ & $84.9 \pm 5.14$ & $81.58 \pm 5.31$ & $85.03 \pm 5.41$ & $81.4 \pm 5.49$
    \end{tabular}
    \caption{Average accuracy $\pm$ standard deviation over 10 seeds for varying graphon resolutions given by $(\text{median number of nodes in training graphs})+r$. The last column is using the baseline GCN model with no $\mathcal{G}$-Mixup.}
    \label{tab:res-search}
    \end{adjustwidth}
\end{table}}

\section{Discussion}\label{sec:discussion}

\subsubsection*{Verification of Claims}
We summarize whether or not our results support the original paper's claims.
\begin{itemize}
    \item Result 1 supports the authors' Claim 1, which is their theoretic claim that $\mathcal{G}$-Mixup produces synthetic graphs that contain a mixture of key graph topologies of source graphs.
    \item In Result 2, we are able to closely replicate the original paper's values for classification accuracy for the REDDIT-B and IMDB-B datasets. In addition, our results for the REDDIT-B and PROTEINS datasets from Result 2 and Additional Result 1 support Claim 2(a), which is the authors' claim that $\mathcal{G}$-Mixup leads to better performance over a vanilla GCN model. However, our results are not statistically significant, and we are unsure how the original paper's standard deviations are much lower than ours with the same experimental setup of 10 random seeds. In addition, our classification results for the IMDB-B dataset do not support this claim, as we found that the vanilla GCN model gives better classification accuracy results than does the GCN model with $\mathcal{G}$-Mixup.
    \item In Additional Result 2, we found that varying the mixup ratio or the graphon resolution of the $\mathcal{G}$-Mixup method does not affect the classification accuracy. Furthermore, our results for the mixup ratio experiment do not support Claim 2(b), which is the authors' claim that $\mathcal{G}$-Mixup leads to better performance over other graph data augmentation methods.
\end{itemize}

\subsubsection*{Overall Conclusion} Although some of our results seem to indicate that $\mathcal{G}$-Mixup is a useful graph data augmentation method, we cannot provide statistically significant evidence of the original paper's experimental claims. Our verification of the authors' theoretical results support their theoretical claim. However, we feel that some of the assumptions used for the theoretical claim are too strong to apply $\mathcal{G}$-Mixup in practice. For example, it is assumed that every class in a graph dataset can be effectively estimated by a graphon, and that discriminative motifs exist and are an effective way to measure key graph topologies. In addition, our results from the hyperparameter searches suggest that it may not be the mixup of graph class topologies that actually leads to any better performance results, but rather that $\mathcal{G}$-Mixup injects noise to the data, which has also been shown to lead to better performance \cite{Grandvalet:1997}.
\\
% We also ran experiments on the full PROTEINS dataset, which extends the paper's results. We note that the vanilla GCN model does not converge on this dataset. One possibility for this is that the PROTEINS dataset has three distinct ``features," whereas the REDDIT-BINARY and other datasets tested in the original paper have none. It is also possible that this is due to an imbalance in the priors of the PROTEINS dataset: around $60\%$ of the graphs are of one class, and $40\%$ of the other class. We plan to investigate this further.

\textbf{Strengths}: Our proofs for the paper's theoretical results are much more extensive. We were able to essentially reproduce the results of the authors experiments. 

\textbf{Weaknesses}: The main weaknesses of our experimental results relate to relatively high standard deviations for all of our results, which affected our analysis. Given more time, it would be ideal to run our experiments using more seeds.

% \textcolor{red}{to do: Give your judgement on if your experimental results support the claims of the paper.}

% \textcolor{red}{to do: Discuss the strengths and weaknesses of your approach - perhaps you didn't have time to run all the experiments, or perhaps you did additional experiments that further strengthened the claims in the paper.}

\subsection{What was easy}
%% our writing
Although the authors' proofs of theoretical claims were sparse, they provided many helpful citations that helped us to fill in gaps easily. We also found these citations helpful overall for understanding the theory of graphons. The authors' code makes it possible to easily run the experiments on other datasets in the PyTorch-Geometric library.

\subsection{What was difficult}\label{subsec:difficult}
%% our writing
We initially had issues running the $\mathcal{G}$-Mixup model for the full REDDIT-B dataset on Google Colab because we ran out of memory. We believe that the authors' code is not well-optimized in this regard. Since most of the computation RAM is in generating the graphons, and the graphon construction algorithm is deterministic, it would have been easier to verify the authors' results if they provided pre-made graphons on some (pre-determined) training data. It was also difficult to install the correct Python packages in our campus computing cluster.

\subsection{Communication with original authors}

We have exchanged emails with the corresponding author, who offered suggestions on how to address our memory issues with the REDDIT-B dataset. However, their suggestions were not able to be implemented or they did not resolve our issues by the time of the writing of this report.
