\section{Experiments}
\label{sec:exp}

In this section, we evaluate the performance of \method and compare it against the most competitive graph SSL methods on a variety of node classification datasets. We first introduce the setup, the datasets, and the baselines used for the evaluation in Section~\ref{sec:settings}. Then we present the evaluation results on small to medium-scale datasets in Section~\ref{sec:small} and results on large-scale datasets in Sectin~\ref{large_eval}. Finally, we present the results of the ablation study in Section~\ref{sec:ablation}.

% current state-of-the-art methods for graph SSL on a variety of popular node classification datasets. \Zhaozhuo{Road map }

\subsection{Settings}
\label{sec:settings}
% \vspace{-0.3cm}
\paragraph{Evaluation Setup} We take an untrained GNN encoder with randomly initialized parameters, and train it using \method and baseline methods on the graph data $(\mathbf X, \mathbf A)$ without labels until convergence. Then we freeze the parameters of the GNN and and use it to encode the nodes into learned node representations $\mathbf{Z}$. We then train a linear classifier (logistic regression classifier) on the labelled training set with $\mathbf Z$ as input, and report the evaluation metrics on the unseen test set. Our evaluation setup is identical to previous works~\citep{velickovic2019deep, thakoor2021large} to keep the evaluation fair and consistent. Details about the testbed for performing the evaluation is given in Section~\ref{sec:testbed}, and the hyper-parameters and training details are presented in Table 1 in the Supplemental Materials.

% For all datasets, we only consider one of the most simple and common graph neural encoder Graph Convolutional Networks (GCN)~\citep{welling2016semi} as the GNN encoder for \method, in order to emphasize the advantage of our proposed learning method instead of focusing on tuning the architecture. The $i$th layer of a GCN follows the propagation rule $\mathbf X_{i} = \sigma(\hat{\mathbf A} \mathbf X_{i-1}\mathbf W_i)$, where $\hat{\mathbf A} = (\mathbf D + \mathbf I)^{-\frac 12} (\mathbf A+\mathbf I) (\mathbf D+\mathbf I)^{-\frac 12}$ is the symmetric normalized adjacency matrix with self-edges, $\mathbf X_0$ is the input node feature matrix, $\mathbf W_i$ is a learnable parameter matrix, and $\sigma(\cdot)$ is an activation function.

\paragraph{Datasets} Our baselines are evaluated on 6 datasets, including 3 small scale datasets (Cora, Citeser, PubMed~\citep{sen2008collective}), 1 medium scale dataset (ogbn-arxiv~\citep{hu2020open}) and 2 large scale dataset (ogbn-proteins and ogbn-products~\citep{hu2020open}). The graph statistics are summarized in Table~\ref{tab:data_stat}.

\input{Sample UAI 2023 paper/tables/data_stat.tex}

\paragraph{Testbed}
\label{sec:testbed}
We implement our proposed method with the Deep Graph Library~\citep{wang2019deep}. Our experiments are conducted on a machine with 1 NVIDIA Tesla V100 32GB GPU, 2 24-core/48-thread Intel Xeon Gold 5220R CPUs, and 1.5TB of RAM.


% The datasets and their tasks are summarized as follows.
% \begin{inparaenum}[\itshape 1\upshape)]
% \item Cora, Citeseer, and PubMed~\citep{sen2008collective} are citation graphs and their task is classification of research papers into topics.
% \item ogbn-arxiv~\citep{hu2020open} is a citation network of computer science research papers and the task is classification into one of the 40 subject areas.
% \item ogbn-proteins~\citep{hu2020open} is a biological graph in which each node is a protein and each edge encodes biological association between proteins. There are 112 binary-classification tasks of protein functions, and the ROC-AUC score is reported.
% \item ogbn-products~\citep{hu2020open}  is a large-scale graph of Amazon product co-purchasing network, and the task is classification of products into one of the 47 categories.
% \end{inparaenum}.

% The baselines are compared on 6 different datasets, including Cora, Citeseer, Pubmed, ogbn-arxiv, ogbn-proteins and ogbn-products. The statistics of the datasets are summarized in table~\ref{tab:data_stat}. While learning the graph representations, the whole graph is used for the training. And the downstream classifier only uses the labels from training set. The training-validation-testing split follows the default setup of ``dgl'' package (``dgl.data'')~\citep{wang2019deep}.


\paragraph{Baselines} We perform a thorough comparison of \method against the current most competitive graph SSL methods: \begin{inparaenum}[\itshape 1\upshape)]
\item DGI~\citep{velickovic2019deep} proposes to learn node representations by maximizing the mutual information between node representations and the global representation through contrasting representations of a corrupted graph.
\item GATE~\citep{gate} reconstructs the input graph with an auto-encoder architecture that uses self-attention.
\item GRACE~\citep{grace} performs contrastive learning on positive and negative examples from two different corrupted graph views.
\item BGRL~\citep{thakoor2021large} learns contrastively from positive examples only by leveraging bootstrapping.
\item LaGraph~\citep{lagraph} learns through a reconstruction loss and an invariance loss between the representations of the original graph and a corrupted graph.
\item GraphMAE~\citep{graphmae} learns through reconstructing node features using two GNNs as encoder and decoder.
\item InfoGCL~\citep{infogcl} maximizes the agreement between the learned representations of two corrupted graph views encoded by a GNN and MLP.
\item CCA-SSG~\citep{ccassg} maximizes the agreement between two corrupted graph views using a loss function inspired by Canonical Correlation Analysis. 
\end{inparaenum}
We also include the most common supervised model baselines for reference, which are trained with the training set as supervision.
\begin{inparaenum}[\itshape 1\upshape)]
\item MLP~\citep{hu2020open} is a multi-layer perceptron network with only the node features as input.
\item GCN~\citep{welling2016semi} propagates node information through convolutional layers.
\item GAT~\citep{gat} leverages self-attention to adaptively aggregate node information.
\item GraphSage~\citep{hamilton2017inductive} aggregates node feature information to generalize to unseen data.
\end{inparaenum}

% To perform a thorough evaluation of \method, our experiments compare it to other graph super-supervised learning methods, such as DGI, BGRL and GRACE.

% \daicomment{TODO: briefly summarize the alternative baselines + implementation of our method}

% \paragraph{Evaluation} To compare the effectiveness of different baselines, we evaluate performance of the learned representation in the downstream classification tasks. In detail, we freeze the representations learned from the baselines, and only train the classifier supervised by the labels. 





\subsection{Evaluation on Small and Medium-scale Graphs}
\label{sec:small}
Table~\ref{tab:acc_small} presents the accuracy of \method and baselines on Cora, Citeseer, Pubmed and ogbn-arxiv. \method achieves state-of-the-art performance on all 4 datasets, and improves the previous best by $2.6\%$ on PubMed, $1.1\%$ on CiteSeer, $0.2\%$ on Cora and $0.33\%$ on ogbn-arxiv. Our method significantly exceeds the accuracy of supervised training on Cora, CiteSeer, PubMed, showing its potential in eliminating the reliance on labels in graph learning. On ogbn-arxiv, our method achieves accuracy competitive with the best supervised model GAT (within $0.02\%$ difference) while using a simpler model (GCN). 


% \scriptsize\parbox[t]{2mm}{\multirow{3}{*}{\rotatebox[origin=c]{90}{Supervised}}}
% \scriptsize\parbox[t]{2mm}{\multirow{10}{*}{\rotatebox[origin=c]{90}{Self-supervised}}}

% Previous graph SSL methods such as GRACE requires memory budget that scales quadratically with the number of nodes in order compute the objective. Other methods like BGRL requires multiple times feed-forward of GNN during a single training iteration. Therefore, it prohibits the training of many baselines due to the limited GPU memory. In this section, we consider the datasets where the baselines can be trained directly over the whole graph. 

% Table~\ref{tab:acc_small} shows the classification accuracy of different baselines. We also \method to the widely used supervised learning methods such as GCN and GAT~\cite{velivckovic2017graph}. \method not just achieves the state-of-the-art accuracy in all the four dataset. It also significantly improves the classification accuracy even compared to the supervised learning methods. Especially for the pubmed dataset, \method improves the SOTA by $2.6\%$. Considering the fact that \method also enjoys the benefit of much smaller memory usage (save at least $2\times$ memory), \method demonstrate its advantage over the previous methods. 

% \vspace{-0.4cm}

\input{Sample UAI 2023 paper/tables/acc_large.tex}
\input{Sample UAI 2023 paper/tables/sensitivity.tex}

\subsection{Evaluation on Large-scale Graphs}
\label{large_eval}

We evaluate \method and baselines on ogbn-proteins and ogbn-products, which are two challenging large-scale node classification datasets. Only \method is able to train on ogbn-proteins using a single GPU without sub-sampling, other methods require sub-sampling to fit into 32 GB of GPU memory since they rely on multiple graph views (see Section~\ref{sec:mem}). We leverage the sub-sampling techniques described by~\citet{hamilton2017inductive, thakoor2021large} to scale the baselines. By using full-graph training and no sub-sampling, \method enjoys a significant advantage on accuracy over other baselines by achieving $4.55\%$ improvement in AUC-ROC than the current best SSL method. More importantly, \method beats supervised training: $5.29\%$ and $0.12\%$ better AUC-ROC than supervised GCN and GraphSage. As a biological graph of protein interactions, ogbn-proteins is sensitive to graph corruptions. Our method achieves the best accuracy by avoiding corruptions and training on the full graph, which maximally preserves semantics of the original graph.

The graph of ogbn-products is so large that even supervised training has to resort to sub-sampling. Therefore, we leverage Cluster-GCN~\citep{clustergcn} to efficiently scale \method by partitioning the graph into 100 clusters. Our method exceeds the previous best SSL method by $3.04\%$. Furthermore, our method beats supervised training by $0.86\%$ when using the same GCN architecture, suggesting our method has potential of eliminating the need for costly labels in graph learning.

\subsection{Ablation Study}
\label{sec:ablation}

We study the sensitivity of our method to hyper-parameter changes. A robust SSL method should not be sensitive to hyper-parameters. This has been a weakness of prior SSL methods, which require vastly different corruption parameters for different datasets~\citep{you2021graph, thakoor2021large, ccassg}. We vary the hyper-parameters in computing the node proximity measures for heat kernel and PPR, and evaluate the test accuracy on Cora, Citeseer and PubMed. As shown in Table~\ref{tab:sensitivity}, our method is not sensitive to hyper-parameters of the node proximity scores, since the accuracy drops at most $1.3\%$ from the best accuracy achieved. Therefore, our method is more robust than previous SSL approaches which are sensitive to hyper-parameter changes.

% ogbn-proteins consumes so much GPU memory that more than one view of the graph does not fit into 32 GB of GPU memory. Therefore, we have to resort to sub-sampling techniques proposed by~\cite{thakoor2022largescale} to scale the baselines to train on ogbn-proteins, since they rely on multiple graph views. In contrast, our method trains on ogbn-proteins without sub-sampling since it computes only a single view of the graph, with memory efficiency similar to supervised training. The advantage of our method is clear: $4.55\%$ better accuracy than the current best SSL method, and even $5.29\%$ and $0.12\%$ better accuracy than supervised training of GCN and GraphSage. The improvement in accuracy can also be explained by the fact that our method does not employ corruption techniques used by prior methods. Biological graphs can be especially sensitive to graph corruptions, since the semantics of the graph may change drastically for slight corruptions. 

% ogbn-products is a very large graph that even supervised training has to resort to sub-sampling. Therefore, we employ Cluster\method.

% Most of previous baselines requires the comparison between all the node representations. However, when the graphs size scales up, it becomes impossible to train over all the nodes. To adapt to the large graphs, BGRL proposes subsampling a batch of nodes during each iteration and only compare the node representations to themselves between these nodes (BGRL only build positive node pairs between two different node augmentations). Similarly, for other baselines such as GRACE, we also build the positive and negative node pairs only from a batch of subsampled nodes to address the scalability bottleneck. 

% \paragraph{Scalability} We evaluate the performance of \method on ogbn-proteins and ogbn-products datasets. The scale of both datasets is challenging even for the supervised training algorithms. Note that for the ogbn-proteins dataset, all the prior baselines run out of memory while training over the whole graph. On the contrary, \method costs much less memory (17.2 GB memory) and can still perform the whole-graph training. Since sub-sampling graph and batchwise training lead to slower convergence rate, therefore, when we need to implement a graph self-supervised algorithm over a graph of similar size, \method enjoys extra benefits of the faster training efficiency.


% For the ogbn-products dataset, most of most popular GPUs (like 32GB V100 or 40GB A100 GPU) cannot even load the graph to the memory. Thus, batchwise training is necessary for all the baselines. As described in section~\ref{sec:cluterGCN}, we incorporate the Cluster-GCN algorithm to our method to address the scalability issue. In the experiment, the number of clusters is set to 100 by default. 


% \paragraph{Accuracy} Table~\ref{tab:acc_large} shows that \method outperforms all the other super-supervised learning baselines. Especially for the ogbn-proteins dataset, \method boosts the classification accuracy by {$\bm{4.55\%}$}, which is a significant jump compared to the previous approaches, and is also the first time to achieve the similar performance compared to the supervised learning approach. Moreover, some alternative approaches such as GRACE is also vulnerable to the approximation of the target (due to the sub-sampling of graph)~\citep{thakoor2021large}, which again suggests the advantage of our method. 
