\section{Graph Self-Supervised Learning}\label{sec:graph_ssl}
In this section, we introduce the graph self-supervised learning problem studied in this paper. Next, we identify two problems of existing graph SSL methods.
\subsection{Problem Formulation}\label{sec:prob_form}

Graph SSL aims to learn a GNN-based encoder that produces high-quality representations for graph data without using labels. We follow the standard problem setup of graph SSL~\citep{velickovic2019deep, grace, thakoor2021large} to keep the training and evaluation procedures consistent with prior approaches. During the training stage, we have access to graph data $(\mathbf X, \mathbf A)$ for training a GNN-based encoder $\mathcal E$, where $\mathbf X \in \mathbb R^{n \times k}$ is the node feature matrix and $\mathbf A \in \mathbb R^{n \times n}$ is the adjacency matrix. We denote each row of $\mathbf X$ as $\mathbf x_i$, which 
% Each row vector of $\mathbf X$, $\mathbf x_i \text{ where } 1\le i\le n$,
corresponds to a $k$-dimensional feature vector of node $i$, where $i\in [n]$. There should be an SSL objective to update the parameters of the encoder $\mathcal E: \mathbb R^{n \times k} \times \mathbb R^{n \times n} \rightarrow \mathbb R^{n \times d}$, which encodes the graph $(\mathbf X, \mathbf A)$ into node representation matrix $\mathbf Z \in \mathbb R^{n \times d}$. Here each row of $\mathbf Z$, denoted as 
$\mathbf z_i$, corresponds to the 
% in which each row vector $\mathbf z_i, 1 \le i \le n$ corresponds to the
$d$-dimensional representation of node $i$.
We evaluate graph SSL methods by training and testing a linear classifier with the learned node representation matrix $\mathbf Z$ on downstream tasks. 

% Specifically, we take the use each node representation $\mathbf z_i$, where $i\in [n]$, as the new node feature and train a linear models on top of it. We report the  performance metrics of 
% freeze the learned node representations $\mathbf Z$ and train a simple linear classifier (logistic regression) on top of it with the training set of the downstream task. We report the performance metrics of the linear classifier with $\mathbf Z$ on the unseen test set of the downstream task.

% Our goal is to train a GNN-based encoder $\mathcal E: \mathbb R^{n \times k} \times \mathbb R^{n \times n} \rightarrow \mathbb R^{n \times d}$ that encodes the graph $(\mathbf X, \mathbf A)$ into encoded node representations denoted by $\mathbf Z \in \mathbb R^{n \times d}$, in which each row vector $z_i, 1 \le i \le n$, corresponds to a $d$-dimensional real-valued encoded representation of node $i$. The quality of the encoded representations $\mathbf Z$ is evaluated with frozen linear evaluation on downstream tasks: with $\mathbf Z$ fixed, we train a linear classifier with $\mathbf Z$ as input on the training set of the downstream task, and report the performance metrics of the linear classifier on the unseen test set.
% \Zhaozhuo{SSL workflow, input output}

\input{Sample UAI 2023 paper/figs/fig1.tex}


\subsection{Graph Corruption Techniques}

Prior competitive graph SSL methods rely on corrupting the input graph to generate positive and negative examples for learning. Graph corruption techniques perturb node attributes or the adjacency matrix to produce alternative graph views~\citep{grace}. In this way, the GNN-based encoder (see Section~\ref{sec:prob_form}) can learn to produce invariant representations. Popular graph corruption techniques include node feature masking~\citep{grace}, node feature shuffling~\citep{velickovic2019deep}, node dropping~\citep{you2020graph}, edge dropping~\citep{grace}, and subgraphing~\citep{you2020graph}. For example, CCA-SSG~\citep{ccassg} and BGRL~\citep{thakoor2021large} employ node feature masking and edge dropping to generate graph views and maximize the agreement between those views, DGI~\citep{velickovic2019deep} uses node feature shuffling to produce negative examples, and GRACE~\citep{grace} uses node feature masking and edge removal for generating inter-view positives, inter-view negatives and intra-view negatives for contrastive learning. The graph corruption techniques are directly inspired by data augmentation methods from the computer vision domain, such as random erasing~\citep{randomerasing} and cropping~\citep{imageaugment}.

However, corruption techniques for vision and graphs have a fundamental difference: corruptions of natural images preserve their underlying semantics, while the properties of a graph may alter significantly after minor corruptions. For example, in the application of a social network or citation graph, the semantics of a node could significantly change if its edge to a hub node is dropped by graph perturbation. Additionally, in the context of molecular graphs, perturbations to the nodes and edges can lead to drastic changes in the molecule's properties~\citep{sun2021mocl}. Through extensive experiments, \cite{you2020graph} demonstrate that edge perturbations in graph SSL significantly degrade the model performance on molecular graphs. It is unclear which corruption techniques are applicable in different graphs, and finding a decent graph corruption requires significant trials and errors since many graphs are highly sensitive to corruption techniques and parameters. As a result, previous works~\citep{you2021graph, thakoor2021large, ccassg} resort to extensive grid search for the best combinations of corruption schemes, and show that different datasets require vastly different corruption parameters since the performance of learned models differ greatly with slight changes to corruption schemes and parameters.

% Some graphs are highly sensitive to corruption functions, since their meanings change drastically from minor perturbations on node or edge information. For example, protein-protein interaction networks and molecular graphs covalent bond (add details). Different from computer vision, graphs are highly sensitive to corruption functions and their hyperparameters, since the performance of the downstream task vary greatly for different corruption schemes. \cite{you2020graph} demonstrates empirically edge perturbations degrade model performance on molecular graphs. \cite{you2021graph} carries out extensive grid search for combinations of corruption schemes and hyperparameters and shows that the performance on downstream task differ greatly for different corruption schemes. Some graph SSL methods rely on trial-and-error to find the combinations of graph corruptions. Moreover, corruptions do not preserve semantics of some graphs. All these evidences imply the over-reliance on corruption functions in graphs is not ideal. 

% \Zhaozhuo{Problems!}

\subsection{Multi-view Representation Learning on Graphs}

In addition to over-reliance on corruption techniques, prior graph SSL approaches compute multiple views of the same graph, which has significant memory and computational overhead. The computation of multiple views are required for previous methods since they mine positive/negative examples from them. For example, DGI~\citep{velickovic2019deep} computes an additional view through shuffling node features to produce negative examples, LaGraph~\citep{lagraph} computes two views and minimizes the distance between them, and BGRL~\citep{thakoor2021large} computes four views for positive-only contrastive learning. This creates significant concerns related to computational efficiency and scalability. Modern hardware used for GNN training such as GPU has limited memory, and hence the computation of multiple views scale poorly to large graphs. Compared to supervised training which only computes a single view of the graph, prior self-supervised methods consume multiple times more memory and computation time. This is problematic in many real-world problems since common citation, co-purchasing, and social network graphs contain millions if not billions of nodes and edges~\citep{hu2020open}. Although sub-sampling techniques can fit multiple views of the graph in a limited memory budget, they have been demonstrated to hurt performance significantly~\citep{thakoor2021large}. As a result, it is ideal to have a graph SSL method that computes only a single view of the graph so that it can scale to larger graphs efficiently.
% and avoid losing performance to sub-sampling.