\section{Introduction}

Graph neural networks (GNN) have shown great potential in a variety of different fields including social networks~\citep{fan2019graph}, recommendation models~\citep{wu2020graph}, and drug discovery and development~\citep{chen2018rise}. GNNs learn high-quality representations of nodes, edges, or graphs by leveraging and aggregating node features, edge features, and adjacency information. Traditional methods of training GNNs is costly, since they require a large amount of labels to achieve high accuracy in downstream tasks. In real-world scenarios, graphs are very large, such as those found in recommendation models and social networks, and data labeling is often prohibitively expensive. Furthermore, it is impractical or impossible to collect labels in fields such biochemistry, since it takes up to two weeks to obtain labels for generated molecules using the current simulation tools, and the costs of laboratory experiments are high~\citep{xiong2019pushing}. These evidences suggest that effective learning methods for graphs without relying on labels is of great significance.\citep{velivckovic2017graph}

% Grash neural networks (GNNs) have been widely implemented a bunch of fields such as social networks, molecules, and geographics~\citep{chanussot2021open, derrow2021eta, wieder2020compact}. To train a GNN with high prediction accuracy, it requires to learn effective representations. Traditionally, the representations are trained in a supervised way, which costs a large number of node labels. However, due to the high labeling cost, it is almost impossible to collect enough labels satisfying the requirement of supervised training. For example, when using GNNs to assist the drug design~\citep{xiong2019pushing}, it usually takes one to two weeks to evaluate property of generated molecules using the current simulation tools, not to mention the cost of the laboratory experiments. 

% GNNs typically learn graph representations in a supervised or
% semi-supervised setting. In practice, obtaining a large number of labels is often difficult or even impossible, especially in specific areas that are very costly, such as in biochemistry. The labeled graphs may be limited, while unlabeled graphs are easy to collect. Self-supervised learning utilizing unlabeled data has made significant progress in computer vision and shows
% great potential in exploring unlabeled data to enhance graph deep learning.

% \textbf{Summarize the two issues of graph self-supervise learning}

% Though current graph self-supervised learning algorithms have achieved a great success in the learning node representations, they still faces two significant constrains. First, most of the graph self-supervised learning algorithms leverage the

Self-supervised learning (SSL) has shown promising potential of eliminating the need for labels in graph problems. Prior methods such as DGI~\citep{velickovic2019deep}, GRACE~\citep{grace}, BGRL~\citep{thakoor2022largescale} rely on contrasting two or more corrupted views of the graph to learn useful representations for graph data, and prove effectiveness in some datasets. However, prior SSL approaches for graphs largely suffer two problems: the over-reliance on unnatural and sometimes unreliable graph corruptions, and the memory and computational overhead as a result of computing multiple graph views. 

Prior competitive self-supervised learning approaches for graphs rely on corruption: perturbations of node attributes or adjacency matrix. Popular graph corruption techniques include node attribute masking, node attribute perturbation, node dropping, edge dropping, and subgraphing. These methods are directly inspired by data augmentation techniques from the computer vision domain, such as random erasing, color jitter, and cropping. However, there is a fundamental difference in augmenting data through corruption between the vision domain and graph domain; effective corruption functions in computer vision usually preserve the underlying semantics of images, while it is unclear whether graph corruption functions maintain the semantics of the original graph. Some graphs are highly sensitive to corruption functions, since their meanings change drastically from minor perturbations on node or edge information. For example, protein-protein interaction networks and molecular graphs covalent bond (add details). Different from computer vision, graphs are highly sensitive to corruption functions and their hyperparameters, since the performance of the downstream task vary greatly for different corruption schemes. \cite{you2020graph} demonstrates empirically edge perturbations degrade model performance on molecular graphs. \cite{you2021graph} carries out extensive grid search for combinations of corruption schemes and hyperparameters and shows that the performance on downstream task differ greatly for different corruption schemes. Some graph SSL methods rely on trial-and-error to find the combinations of graph corruptions. Moreover, corruptions do not preserve semantics of some graphs. All these evidences imply the over-reliance on corruption functions in graphs is not ideal. 

In addition to over-reliance on corruption functions, prior competitive self-supervised graph learning approaches require the computation of multiple views of the same graph, from which they mine positive and negative examples for contrastive learning. Modern hardware used for GNN training such as GPU has limited amount of memory, and the computation of multiple views scale poorly to large graphs. Compared to supervised training which only computes a single view of the graph, prior self-supervised methods consume multiple times more memory and computation time which poses a scalability problem. This is problematic in many real-world problems since common citation, co-purchasing, and social network graphs contain millions of nodes and edges. Although sub-sampling techniques exist to fit multiple views of the graph in a limited memory budget, they have been demonstrated to hurt performance significantly  \cite{thakoor2022largescale}. It is ideal for self-supervised graph learning to compute only a single view of the graph, in order to scale to larger graphs and avoid losing performance to more sub-sampling.

% \paragraph{Unnatural Corruption}


% \paragraph{High Computation Overhead}


% \paragraph{Our contribution}
% \begin{itemize}
%     \item corruption free
%     \item easy achieve enough positive/negative node pairs
%     \item scalable to large graph training
% \end{itemize}

% \textbf{Describe our motivation} leverage diffusion to address the issue of false positive node pairs (need to define false positive node pairs). diffusion could mitigate the hetereophily of the original (normalized) adjacency matrix. Provide evidence...
% place the figure here...

% \textbf{Highlight our experiment results} Show Number here 1) state-of-the-art performance 2) better performance than supervised learning 3) scalable to large graph training.

% The significant flaws and drawbacks of the current approaches call for a corruption-free single-view approach for graph SSL. The naive method 


Instead of constructing alternative views of the same graph to mine positive/negative examples for contrastive learning, we can consider nodes within the same graph as positive or negative examples of each other. A naive target for self-supervised graph learning is the adjacency matrix; in other words, we can consider neighboring nodes as positive examples and non-neighboring nodes as negatives. However, this learning target is noisy and ineffective, since the data collection process of real-world graphs is highly noisy. (give example). Moreover, we argue that it is unnatural to consider node pairs as hard positive or negative examples; depending on node and structural properties, two nodes far away in distance may be semantically similar while two neighboring nodes may be polar opposites.

We propose to leverage node proximity measure as the learning target for single-view graph self-supervised learning. The concept of proximity measures for graph has been widely used in natural sciences for measuring the spread of matter in random Brownian motion, and in search engines for measuring the importance of web pages. Proximity scores measure the direct and indirect connection strength between node pairs, which are higher-order connectivity information than the adjacency matrix.

Unlike adjacency matrices that are binary (1 for neighbor and 0 for non-neighbor), diffusion is a row-stochastic matrix in which each entry is non-negative real value and each row sums to 1. Therefore, by using diffusion as target, we consider node pairs as soft positive/negative examples, which is a novel paradigm of graph self-supervised learning. Moreover, diffusion has been shown to be a denoising filter on graph structure, similar to a low-pass filter in computer vision \cite{gasteiger_diffusion_2019}. We propose a generalized version of graph homophily for measuring the quality of learning target and quantify the advantage of diffusion against adjacency matrix. 

In this paper, we address the two problems mentioned above by introducing a self-supervised graph learning approach that do not rely upon corruption functions for better representation learning and computes a single graph view for better scalability and performance. 

Prior methods use these perturbations to produce alternative views of the same graph for contrastive learning. However, unlike images, these kinds of augmentations are not natural for graph data (give examples). It is ideal to perform self-supervised learning on graphs without corruptions. 

Another problem of prior competitive self-supervised learning approaches is that they compute multiple views of the graph. Modern hardware used for GNN training such as GPU has limited amount of memory, and multiple views of the same graph consume a large amount of memory compared to supervised training in which only one view of the graph is computed. It has been shown in [] that sub-sampling the original graph decreases the accuracy, so it is ideal to compute only a single representation of the graph.

To this end, we propose a novel approach for graph self-supervised learning that is corruption-free and requires the computation of only a single view of the graph representation. Furthermore, for graphs that are too large to fit into GPU memory, we leverage recent advances in sub-sampling for GNN training to derive a principled approach for self-supervised training on large graphs. We demonstrate the effectiveness of our approach by evaluation on popular graph datasets. 

Prior competitive self-supervised learning methods for graphs largely follow the same line of thinking: generate multiple views of the graph via corruption, and perform contrastive learning on positive/negative instances from different views. DGI proposes to contrast global view and local view of the graph through corruption function such as node shuffling. GraphCL uses node dropping, edge perturbation, attribute masking, subgraphing. GRACE uses edge dropping and node masking to generate corrupted graph views, and uses the same nodes across graph views as positive instances and different nodes from intra-view and inter-view as negative instances. GraphMAE uses the reconstruction objective, corrupts the graph view through masking and uses a GNN-encoder and another GNN-decoder to reconstruct the masked features. 

It is not clear whether the corrupted view of the graph should be positive or negative example for learning. We propose an alternative view of self-supervised learning for graphs, that is to view all nodes as soft positive/negative instances for a given node. We leverage the PPR matrix as a measure of relative importance, and argue that it is an excellent soft label for contrastive learning. 
