\section{{Proximity Divergence Minimization}}
\label{sec:cursive}
In this section, we first motivate our proposed method by describing a strawman approach and identify its flaws in \ref{sec:motivation}. Then we describe our proposed method by explaining our learning target in Section~\ref{sec:proximity} and training strategy in Section~\ref{sec:training}. We then analyse the memory efficiency of our proposed approach and compare it against existing methods in Section~\ref{sec:mem}. Finally, we describe an extension to our method for scaling to very large graphs in Section~\ref{sec:cluterGCN}.


% xxx \Zhaozhuo{Add a roadmap here.}


\subsection{Our Motivation: A Corruption-free Single-view SSL on Graphs}
\label{sec:motivation}

The drawbacks of corruption techniques and multi-view approaches call for a corruption-free single-view SSL approach for graphs. However, this is nontrivial in practice.

\input{Sample UAI 2023 paper/figs/fig2.tex}
\subsubsection{A Strawman Approach}

A strawman approach for corruption-free single-view SSL is to perform contrastive learning using neighboring nodes as positive instances and non-neighboring nodes as negative instances. But this method has two major problems: it only sees the local structure and fails to take the graph structure at a macro level into account, and it is impacted by noise in real-world datasets.

In this strawman approach, neighboring nodes are considered as positive examples and their mutual information is maximized. However, this is counterproductive because it ignores the rich information implied by the graph structure at a macro level. The strengths of connections between nodes vary greatly. For example, a graph may consists of a few densely connected node clusters with sparse edges across clusters. This is a common structure for many real-world graphs such as citation and social networks~\citep{fan2019graph}. The graph structure on a macro level implies nodes are more strongly linked to nodes within the same cluster, with weaker connection to nodes in other clusters. However, this information is not captured in the strawman approach, in which each edge is considered as equally strong. This could mislead the feature learning target and discourage the encoder from understanding the global graph structure. As a result, the learned node representations are entangled, shown in Figure~\ref{fig:diffusion}. Furthermore, real-world graphs often include a large amount of noisy edges~\citep{kang2019robust}. The existence of noisy edges confuses the encoder and results in poorer representations. 

\subsection{Our Proposal}

We propose Proximity Divergence Minimization (\method), which views node proximity as a distribution and uses it as the learning objective for similarity between node representations. Our proposed method resolves the two aforementioned problems and incorporates the knowledge of graph structure into the learned representations.

\subsubsection{Node Proximity Distribution}
\label{sec:proximity}
A node proximity score $P_u(v)$ measures the strength of direct and indirect connections between a pair of nodes $u \text{ and } v$ in a graph~\citep{proximity}. Unlike the classical distance metrics between nodes such as the shortest path distance, proximity measures take all connections into account to capture rich structural information between the relationship of a node pair~\citep{prox_measures}. Leveraging node proximity measures overcomes the aforementioned two problems. Since proximity measures take all connections into account, it smoothes out the noisy edges in real-world graphs and takes into account the structure of the graph on a macro level. Moreover, it captures the differences in strength of connections between node pairs, unlike the binary adjacency information.

We consider only the proximity measures that are normalized, i.e., $\sum_{v=1}^n P_u(v) = 1$. We view the proximity score for a certain node as a distribution over all nodes in the graph. The distributions of proximity are used as the learning target in our proposed method. We consider the following three types of node proximity measures in this work.

\paragraph{Heat Kernel} Heat kernel is a technique commonly applied in natural sciences to measure the distribution of heat or diffusive matter~\citep{heat_physics}, and~\citet{heat_chung} generalized the heat kernel to discrete graph structures. The heat kernel matrix is a convergent, infinite sum of weighted $i$-hop adjacency matrices, 
\begin{equation}
\mathbf P_\text{heat} = \sum_{i=0}^\infty \alpha_i^\text{heat} \hat{\mathbf A}^i    
\end{equation}
where $\hat{\mathbf A} = (\mathbf D + \mathbf I)^{-\frac 12} (\mathbf A+\mathbf I) (\mathbf D+\mathbf I)^{-\frac 12}$, $\mathbf D \in \mathbb R^{n \times n}$ is the diagonal degree matrix, and $\alpha_i^\text{heat} = \frac{e^{-t}t^i}{i!}$ for diffusion time $t$~\citep{gasteiger_diffusion_2019}. The node proximity score between nodes $u, v$, based on heat kernel, is the $u, v$ entry of the heat matrix, i.e. $P_u(v) = \left[\mathbf P_\text{heat} \right]_{u, v}$.

\paragraph{Personalized PageRank} Personalized PageRank (PPR) was originally proposed to use in search engines to measure the personalized importance of a web page~\citep{page1999pagerank}. It can be interpreted as a probability matrix in which the entry $u, v$ is the probability of a random walk starting at node $u$ eventually terminating at node $v$. The PPR matrix is defined as 
\begin{equation}
\mathbf P_\text{PPR} = \sum_{i=0}^\infty \alpha_i^\text{PPR} \hat{\mathbf A}^i
\end{equation}
where $\alpha_i^\text{PPR} = \beta (1-\beta)^i$ for the teleport probability $\beta$~\citep{gasteiger_diffusion_2019}. The node proximity score between nodes $u, v$, based on PPR, is the $u, v$ entry of the PPR matrix, i.e. $P_u(v) = \left[\mathbf P_\text{PPR} \right]_{u, v}$.

\input{Sample UAI 2023 paper/tables/ram.tex}

\paragraph{SimRank} SimRank \citep{simrank} is a proximity measure that determines the similarity between nodes based on their structural contexts, which can be combined with domain-specific similarity to be made more informative. It is motivated by the insight that two nodes are similar if they are pointed to by similar nodes. The SimRank score between nodes $u, v$ is defined recursively as
\begin{multline}
    \mathbf P_\text{SimRank}(u, v) = \\\frac{C}{|\mathcal N_\text{in}(u)||\mathcal N_\text{in}(v)|}\sum_{a \in \mathcal N_\text{in}(u)}\sum_{b \in \mathcal N_\text{in}(v)}\mathbf P_\text{SimRank}(a, b)
\end{multline}
where $\mathcal N_\text{in}(u)$ denotes the set of in-neighbors of $u$ and $C \in (0, 1)$ is a constant. In practice, the SimRank score is computed iteratively until convergence or for a fixed number of iterations, as follows
\begin{multline}
    \mathbf P_\text{SimRank}^{i+1}(u, v) = \\\frac{C}{|\mathcal N_\text{in}(u)||\mathcal N_\text{in}(v)|}\sum_{a \in \mathcal N_\text{in}(u)}\sum_{b \in \mathcal N_\text{in}(v)}\mathbf P_\text{SimRank}^i(a, b)
\end{multline}
with initialization
$$
    \mathbf P_\text{SimRank}^0(u, v)=
    \begin{cases}
        0 & \text{if } u \neq v \\
        1 & \text{if } u = v
    \end{cases}
$$

% We adopt the definition of generalized graph diffusion proposed by \cite{gasteiger_diffusion_2019}, which is defined as a convergent, infinite sum of weighted $i$-hop normalized adjacency matrices with self-edges, $\mathbf T=\sum_{i=0}^\infty \alpha_i \hat{\mathbf A}^i$. We use the symmetric normalization scheme proposed in \cite{welling2016semi} for the adjacency matrix with self-edges, i.e., $\hat{\mathbf A} = (\mathbf D + \mathbf I)^{-\frac 12} (\mathbf A+\mathbf I) (\mathbf D+\mathbf I)^{-\frac 12}$, where $\mathbf D$ is the diagonal degree matrix. We consider two types of graph diffusion: heat diffusion~\citep{heat_chung} and personalized PageRank (PPR)~\cite{page1999pagerank}. For the weighting coefficients, heat diffusion uses an exponential series $\alpha_i^\text{heat} = \frac{e^{-t}t^i}{i!}$, where $t$ is the diffusion time~\citep{heat_chung}, and PPR diffusion uses a geometric series $\alpha_i^\text{PPR} = \beta (1-\beta)^i$, where $\beta$ is the teleport probability~\citep{page1999pagerank}. PPR diffusion is interpreted as a probability matrix in which the entry $u, v$ is the probability of a random walk starting at node $u$ with terminating probability $\beta$ at each step eventually terminating at node $v$, and it is used in search engines to measure the personalized importance of a web page~\citep{page1999pagerank}. Both forms of diffusion have closed-form solutions, or they can be approximated by computing the sum of only the first few terms in $\sum_{i=0}^\infty \alpha_i \hat{\mathbf A}^i$.

\subsubsection{Training Strategy}
\label{sec:training}

The main idea of \method is to minimize the divergence between the distribution of node representation similarity and the distribution of node proximity.

\paragraph{Node Representation Similarity} To measure the similarity between the learned representations of two nodes $u, v$, we use the dot product $\text{sim}(\mathbf z_u, \mathbf z_v) = \mathbf z_{u}^\top\mathbf z_{v}$. We apply softmax normalization to normalize the similarity scores between node $u$ and all other nodes into a distribution. Specifically, the representation similarity distribution of node $u$ is 
\begin{equation}
S_u(v) = \frac {\exp(\text{sim}(\mathbf z_u, \mathbf z_v))}{\sum_{i=1}^n \exp(\text{sim}(\mathbf z_u, \mathbf z_i))}, \text{where } v \in [n]
\end{equation}

\newcommand\scalemath[2]{\scalebox{#1}{\mbox{\ensuremath{\displaystyle #2}}}}

\paragraph{Loss Function} Our proposed loss function is the mean Kullback-Leibler divergence~\citep{kld} between the node proximity distribution and node representation similarity distribution, 
\begin{multline}
    \frac 1n \sum_{u=1}^n D_{\text{KL}}(P_u \parallel S_u) = \\\frac 1n \sum_{u=1}^n (\sum_{v=1}^n P_u(v)\log P_u(v) - \sum_{v=1}^n P_u(v)\log S_u(v))
\end{multline}
Since the entropy of the node proximity distribution of a given graph is fixed (i.e. $\sum_{v=1}^n P_u(v)\log P_u(v)$ is a constant), we omit it and equivalently minimize the following loss function.
\begin{equation}
\mathcal L = -\frac 1n \sum_{u=1}^n \sum_{v=1}^n P_u(v) \log \frac {\exp(\text{sim}(\mathbf z_u, \mathbf z_v))}{\sum_{i=1}^n \exp(\text{sim}(\mathbf z_u, \mathbf z_i))}
\end{equation}
An overview of \method's training strategy is shown in Figure~\ref{fig:main}.

In practice, using the proximity distributions of a subsample of all nodes accelerates convergence and avoids loading all proximity scores into GPU RAM. Therefore, for a batch $B$ of sampled indices, we minimize the following batched loss function.
\begin{equation}
\mathcal L_{\text{batch}} = -\frac 1{|B|} \sum_{u \in B} \sum_{v=1}^n P_u(v) \log \frac {\exp(\text{sim}(\mathbf z_u, \mathbf z_v))}{\sum_{i=1}^n \exp(\text{sim}(\mathbf z_u, \mathbf z_i))}
\end{equation}

Empirically, we found batch sizes of $1024$ or $2048$ for node proximity distributions work well.


% To avoid loading all the proximity scores into GPU memory, we can sample a batch of proximity distributions. 

% In practice, mini-batch gradient descent help convergence and reduce memory usage~\citep{graddesc}. Therefore, in each training step, we sample a batch of node indices $B = \{b_1, \cdots, b_{|B|}\}$ from $\{1, \cdots, n\}$ without replacements, and minimize the following batched loss function.
% \[
% \mathcal L_{\text{batch}} = -\frac 1{|B|} \sum_{u \in B} \sum_{v=1}^n \mathbf T_{u, v} \log \frac {\exp(\text{sim}(\mathbf z_u, \mathbf z_v))}{\sum_{i=1}^n \exp(\text{sim}(\mathbf z_u, \mathbf z_i))}
% \]
% With batched execution, only $|B|$ rows of the diffusion matrix needs to be loaded into GPU memory, which reduces the memory complexity of the loss computation from $O(n^2)$  to $O(|B|n)$. Since the batched loss function and the original loss function are equal in expectation, i.e. $E(\mathcal L_{\text{batch}}) = \mathcal L$, the quality of the learned model does not degrade with batched execution. Pseudo-code for \method is given in Algorithm~\ref{algo1}.

\paragraph{Intuition} Node proximity measures are higher-order connectivity scores for node pairs. It is a more informed measure of structural relationship between nodes than adjacency information. It is able to smooth out noisy connections and boost the signal-to-noise ratio of the graph spectrum. Minimizing the divergence between proximity distribution and representation similarity distribution ensures that the encoder learn to incorporate structural knowledge into the learned node representations. Instead of using hard positive/negative instances, we leverage proximity measures between nodes as supervision to tune the relationship between learned node representations using the connection strength implied by the graph structure at a macro level.

\paragraph{Limitations} Our proposed method PDM is based on the assumption of homophily \cite{homophily}, meaning neighboring nodes tend to be more similar. Therefore, PDM may not perform well on non-homophilous graphs. Additionally, choosing a good proximity measure for learning on a particular graph may require trial and error, as there is a lack of theoretical guidance. Proximity measures may also be computationally expensive to compute.


% Information should be spread from one node to other nodes across the graph in a natural process similar to how heat diffuses through a medium. With the diffusion matrix, we are able to measure the amount of information a source node should spread to other nodes. Since the diffusion matrix is symmetric, the amount of information spread from node $u$ to node $v$ is equal to the amount spread from $v$ to $u$. We can use the similarity of two nodes' representations to measure the amount of their shared information. Therefore, if we enforce the similarity between nodes to follow the diffusion matrix, we can encourage the representation of each node to capture the ideal amount of information from nodes in the neighborhood around it.

% Instead of maximizing mutual information between positive examples, we view the normalized similarity measure between a node and all other nodes as a distribution, and consider each row of the row-stochastic diffusion matrix as the target distribution, and minimize the divergence between the two distributions. This way, we do not use explicit positive and negative examples for learning, but we tune the relatedness of the learned node representations using the connection strength implied by the graph structure at a macro level.

% \paragraph{Intuition} In a graph, information in the neighbors around a node is important for the node's learned representation. Information should be spread from one node to other nodes across the graph in a natural process similar to how heat diffuses through a medium. With the diffusion matrix, we are able to measure the amount of information a source node should spread to other nodes. Since the diffusion matrix is symmetric, the amount of information spread from node $u$ to node $v$ is equal to the amount spread from $v$ to $u$. We can use the similarity of two nodes' representations to measure the amount of their shared information. Therefore, if we enforce the similarity between nodes to follow the diffusion matrix, we can encourage the representation of each node to capture the ideal amount of information from nodes in the neighborhood around it.

% We propose to use the graph diffusion matrix as the learning target to resolve the two mentioned problems and incorporation the knowledge of graph structure into learned representations. Diffusion measures the flow of matter or information from one node to another through the global graph structure. It has direct connections to natural sciences~\citep{heat_physics}. For example, heat graph diffusion generalizes the kernel used for measuring the flow and distribution of heat to discrete graph structure~\citep{heat_chung}. The continuous distribution of matter or information measured by the diffusion matrix can be seen as a measure of connection strength between nodes. This idea has been leveraged in web page rankings by using personalized PageRank as a diffusion matrix to measure the relative importance of web pages~\citep{page1999pagerank}. Thus, diffusion takes into account the global structure of the graph by emphasizing important connections while weakening unimportant ones. Moreover, the diffusion matrix has been shown to act as a denoising filter for graph structure to smooths out noisy edges~\citep{gasteiger_diffusion_2019}. The diffusion matrix significantly boosts the signal-to-noise ratio of a graph.

% Instead of maximizing mutual information between positive examples, we view the normalized similarity measure between a node and all other nodes as a distribution, and consider each row of the row-stochastic diffusion matrix as the target distribution, and minimize the divergence between the two distributions. This way, we do not use explicit positive and negative examples for learning, but we tune the relatedness of the learned node representations using the connection strength implied by the graph structure at a macro level.

% \paragraph{Intuition} In a graph, information in the neighbors around a node is important for the node's learned representation. Information should be spread from one node to other nodes across the graph in a natural process similar to how heat diffuses through a medium. With the diffusion matrix, we are able to measure the amount of information a source node should spread to other nodes. Since the diffusion matrix is symmetric, the amount of information spread from node $u$ to node $v$ is equal to the amount spread from $v$ to $u$. We can use the similarity of two nodes' representations to measure the amount of their shared information. Therefore, if we enforce the similarity between nodes to follow the diffusion matrix, we can encourage the representation of each node to capture the ideal amount of information from nodes in the neighborhood around it.

% \subsection{Graph Diffusion Matrix as the Target}
% \label{sec:target}

% We adopt the definition of generalized graph diffusion proposed by \cite{gasteiger_diffusion_2019}, which is defined as a convergent, infinite sum of weighted $i$-hop normalized adjacency matrices with self-edges, $\mathbf T=\sum_{i=0}^\infty \alpha_i \hat{\mathbf A}^i$. We use the symmetric normalization scheme proposed in \cite{welling2016semi} for the adjacency matrix with self-edges, i.e., $\hat{\mathbf A} = (\mathbf D + \mathbf I)^{-\frac 12} (\mathbf A+\mathbf I) (\mathbf D+\mathbf I)^{-\frac 12}$, where $\mathbf D$ is the diagonal degree matrix. We consider two types of graph diffusion: heat diffusion~\citep{heat_chung} and personalized PageRank (PPR)~\cite{page1999pagerank}. For the weighting coefficients, heat diffusion uses an exponential series $\alpha_i^\text{heat} = \frac{e^{-t}t^i}{i!}$, where $t$ is the diffusion time~\citep{heat_chung}, and PPR diffusion uses a geometric series $\alpha_i^\text{PPR} = \beta (1-\beta)^i$, where $\beta$ is the teleport probability~\citep{page1999pagerank}. Heat diffusion is a concept commonly applied in natural sciences to measure the distribution of heat or diffusive matter~\citep{heat_physics}, and \cite{heat_chung} generalized the heat kernel to discrete graph structures. PPR diffusion is interpreted as a probability matrix in which the entry $u, v$ is the probability of a random walk starting at node $u$ with terminating probability $\beta$ at each step eventually terminating at node $v$, and it is used in search engines to measure the personalized importance of a web page~\citep{page1999pagerank}. Both forms of diffusion have closed-form solutions, or they can be approximated by computing the sum of only the first few terms in $\sum_{i=0}^\infty \alpha_i \hat{\mathbf A}^i$.

% Following the definition above, the diffusion matrix $\mathbf T$ is a row-stochastic matrix, and hence each row of $\mathbf T$ can be viewed as a distribution. Specifically, we define the diffusion distribution of node $u$ as $\mathbb T_u(v) = \mathbf T_{u, v}, v \in [n]$, where $\mathbf T_{u, v}$ is the value in the $u$th row and $v$th column of the diffusion matrix. We will use the diffusion distribution of nodes as the target of our learning objective. 



% \subsection{Training Strategy}
% \label{sec:training}

% \paragraph{Loss Function} Our proposed objective minimizes the divergence between the diffusion distribution and the similarity distribution of node $u$, for all $u \in [n]$. Therefore, we construct our loss function as the mean Kullback–Leibler divergence~\citep{kld} between the diffusion distribution and the similarity distribution, $\frac 1n \sum_{u=1}^n D_{\text{KL}}(\mathbb T_u \parallel \mathbb S_u) = \frac 1n \sum_{u=1}^n (\sum_{v=1}^n \mathbb T_u(v)\log \mathbb T_u(v) - \sum_{v=1}^n \mathbb T_u(v)\log \mathbb S_u(v))$. Since the entropy of the diffusion distribution of a given graph is fixed (i.e. $\sum_{v=1}^n \mathbb T_u(v)\log \mathbb T_u(v)$ is a constant value), we omit it and equivalently minimize the following loss function. An overview of \method's training strategy is shown in Figure~\ref{fig:main}.
% \[
% \mathcal L = -\frac 1n \sum_{u=1}^n \sum_{v=1}^n \mathbf T_{u, v} \log \frac {\exp(\text{sim}(\mathbf z_u, \mathbf z_v))}{\sum_{i=1}^n \exp(\text{sim}(\mathbf z_u, \mathbf z_i))}
% \]

% To avoid loading all the proximity scores into GPU memory, we can sample a batch of proximity distributions. 

% In practice, mini-batch gradient descent help convergence and reduce memory usage~\citep{graddesc}. Therefore, in each training step, we sample a batch of node indices $B = \{b_1, \cdots, b_{|B|}\}$ from $\{1, \cdots, n\}$ without replacements, and minimize the following batched loss function.
% \[
% \mathcal L_{\text{batch}} = -\frac 1{|B|} \sum_{u \in B} \sum_{v=1}^n \mathbf T_{u, v} \log \frac {\exp(\text{sim}(\mathbf z_u, \mathbf z_v))}{\sum_{i=1}^n \exp(\text{sim}(\mathbf z_u, \mathbf z_i))}
% \]
% With batched execution, only $|B|$ rows of the diffusion matrix needs to be loaded into GPU memory, which reduces the memory complexity of the loss computation from $O(n^2)$  to $O(|B|n)$. Since the batched loss function and the original loss function are equal in expectation, i.e. $E(\mathcal L_{\text{batch}}) = \mathcal L$, the quality of the learned model does not degrade with batched execution. Empirically, we found batch sizes of $1024$ and $2048$ work well. Pseudo-code for \method is given in Algorithm~\ref{algo1}.

% % An ideal graph encoder will spread information from a node to other nodes across the graph similar to how heat diffuses through a medium. The diffusion matrix essentially measures the amount of information a source node spreads to another node. Since the diffusion matrix is symmetric, the amount of information spread from node $u$ to node $v$ is equal to the amount of information spread from $v$ to $u$. The learned representation of a node is influenced by the information diffused from other nodes. The larger the amount of information diffused between two nodes, the more similar their representations should become, as this property will help the downstream classifiers.

% \begin{algorithm}[H]
% \SetAlgoLined
%     \KwInput{node features $\mathbf X \in \mathbb R^{n \times k}$, adjacency matrix $\mathbf A \in \mathbb R^{n \times n}$, GNN encoder $\mathcal E: \mathbb R^{n \times k} \times \mathbb R^{n \times n} \rightarrow \mathbb R^{n \times d}$, number of epochs $n_\text{epochs}$}
%     \KwOutput{trained GNN encoder $\mathcal E: \mathbb R^{n \times k} \times \mathbb R^{n \times n} \rightarrow \mathbb R^{n \times d}$}
%     Compute the diffusion matrix $\mathbf T$ through the closed-form solution or approximation\\
%     \For{$e=1\cdots n_\text{epochs}$}{
%         Let $N = \{1, \cdots, n\}$ \\
%         \While{$|N| > 0$}{
%             Sample a batch of node indices $B = \{b_1, \cdots, b_{|B|}\}$ from $N$ without replacements\\
%             Let $\mathbf Z$ be the output of the encoder $\mathcal E(\mathbf A, \mathbf X)$\\
%             Compute $\mathcal L = -\frac 1{|B|} \sum_{u \in B} \sum_{v=1}^n \mathbf T_{u, v} \log \frac {\exp(\text{sim}(\mathbf z_u, \mathbf z_v))}{\sum_{i=1}^n \exp(\text{sim}(\mathbf z_u, \mathbf z_i))}$ and perform back-propagation to update the parameters of $\mathcal E$
%         }
%     }
%     return $\mathcal E$
% \caption{Our proposed method CURSIVE for graph SSL}
% \label{algo1}
% \end{algorithm}
% \vspace{-0.2cm}
% \vspace{-0.5cm}
% \textbf{Relative Node Importance} We use the proximity measure Personalized PageRank (PPR) to measure the relative importance of a node with respect to a source node. The Personalized PageRank matrix, $P \in \mathbb R^{n \times n}$, is defined such that $P_{u, v}$ is the probability of an $\alpha$-random walk starting at node $u$ ending at node $v$. It has been used in applications such as web page ranking in search engines to measure the relative significance of a node with respect to other nodes in a graph. This measure of importance is relative since the importance scores are normalized with respect to the source node, i.e. $P_{u, v} \ge 0, \sum_{i=1}^n P_{u, i} = 1, 1 \le u, v \le n$. The PPR matrix has the following closed-form solution,

% $$
% \mathbf P = \alpha(\mathbf I - (1 - \alpha)\mathbf{\Tilde{A}}),
% $$

% where $\mathbf{\Tilde{A}}$ is a normalized adjacency matrix with self-loops. We use the normalization scheme proposed in GCN and use $\mathbf{\Tilde{A}} = (\mathbf D+ \mathbf I)^{-\frac 12}(\mathbf A+ \mathbf I)(\mathbf D+ \mathbf I)^{-\frac 12}$, where $\mathbf D$ is the diagonal degree matrix.

% \textbf{Relative Similarity of Encoded Representations} We use dot product to measure the similarity between two encoded node representations, which is a common measure of similarity in machine learning applications. The relative similarity between two encoded node representations is the softmax-normalized similarity with respect to the source node, i.e., the relative similarity of node $v$'s encoded representation with respect to the source node $u$ is $\frac{\exp{(h_u\cdot h_v)}}{\sum_{i=1}^n \exp{(h_u \cdot h_i)}}$.

% \textbf{Loss Function} Our goal is to maximize the agreement between the relative similarity between encoded representations and the relative importance between nodes. Since the relative importance matrix $P$ and the relative similarity matrix $S$ are both row-stochastic, each row in $\mathbf P$ or $\mathbf S$ can be viewed as a discrete probability distribution. We propose to use the averaged Kullback–Leibler divergence between corresponding rows of $\mathbf P$ and $\mathbf S$ as the loss function, $-\frac 1n\sum_{u=1}^n\sum_{v=1}^n \mathbf P_{u, v}\log \frac{\mathbf S_{u, v}}{P_{u, v}}$. Since the PPR matrix $\mathbf P$ remains constant when parameters of the GNN-based encoder are updated through gradient descent, we will equivalently use the averaged cross entropy between corresponding rows of $\mathbf P$ and $\mathbf S$ as the loss function,

% $$
% \begin{aligned}
% \mathcal L &= -\frac 1n\sum_{u=1}^n\sum_{v=1}^n \mathbf P_{u, v}\log \text{sim}
% \end{aligned}
% $$

% Lightweight and symmetrical similarity measure dot product
% Diffusion matrix is has dimension $n\times n$, which might not fit into GPU RAM. Sample $b$ rows of the diffusion matrix, similar to mini-batch gradient descent.
% Divide into subgraphs and compute diffusion matrix 

% \iffalse
% \textbf{Intuition} If an $\alpha$-random walk starting from source node $u$ has high probability of ending at node $v$, then node $v$ likely exerts a large influence on the representation of node $u$.

% GRACE uses intra-view nodes as hard negative instances, which is not reasonable.

% \cite{xu2018representation} defined the influence distribution of the learned node representation for input features of node as $I$ and proved that with a k-layer GCN, the influence distribution of any node is equivalent to the k-step random walk distribution of the node. \cite{gasteiger2018combining} showed a connection between the influence distribution with infinite step random walk and personalized pagerank, a type of generalized diffusion. Entries of diffusion matrix can be used to measure influence of one node on another. Since diffusion is symmetric, the influence score is mutual between any pair of nodes. Influence is good way 

% showed that, with a k-layer GNN, the influence distribution of any node is equivalent to the k-step random walk distribution of the node. But the limited range of GNN constrains the influence of nodes further away. Ideally, the influence should scale to the entire graph. Mutual influence should be proportional to representation similarity, a similar representation helps downstream classifier. 

% Influence of a node spreads to another node in a random walk manner. 
% \cite{gasteiger2018combining} showed that the limited range of GNNs can be mitigated with the use of PPR, a type of generalized graph diffusion. \cite{gasteiger2018combining} made a connection between taking the influence distribution to the infinity and PPR, and scaled the neighborhood size of each node to essentially the entire graph, and demonstrated that it improves model performance. We take an alternative view, instead of modifying the model architecture, we integrate diffusion into the learning target to mitigate the limited neighborhood problem. Influence 

% influence distribution to infinity steps and PPR approximately 
% \fi

\subsection{Memory Analysis}
\label{sec:mem}

Without the reliance on multiple graph views or extra MLP layers, our approach has clear advantage in memory efficiency over prior approaches. Table~\ref{ram} presents the memory complexity and the empirical GPU memory usage of the most competitive graph SSL methods on ogbn-arxiv and ogbn-proteins datasets~\citep{hu2020open}. Each forward pass/back-propagation consumes $O(n+m)$ memory, where $n$ is the number of nodes and $m$ is the number of edges in the graph. We let $C^\text{fw}$ be the constant factor for each forward pass and $C^\text{bw}$ be the constant factor for each back-propagation. We consider the most memory-efficient graph SSL methods GRACE~\citep{grace}, BGRL~\citep{thakoor2021large}, LaGraph~\citep{lagraph}, and CCA-SSG~\citep{ccassg}, all of which compute two or more graph views at each training step. GRACE uses intra-view and inter-view negative examples, and hence computing its loss function consumes $O(n^2)$ memory. BGRL does not use negative instances in its loss function to avoid quadratic blowup, but it computes 4 graph views in total, 2 by the online encoder and 2 by the target encoder. LaGraph and CCA-SSG both compute 2 graph views and maximize the invariance between views, and LaGraph uses an additional MLP component as the decoder. Our method computes a single graph view, with memory efficiency on par with supervised training theoretically and empirically. 

\input{Sample UAI 2023 paper/tables/acc_small.tex}

We employ the same encoder for all approaches to ensure a fair comparison of memory usage. On ogbn-arxiv, the memory efficiency of our method is on par with supervised training, while other methods consume $2\times$ or more memory. On ogbn-proteins, supervised training consumes more than half the memory on a 32GB GPU, which makes multi-view training impractical. Therefore, only our SSL method is able to train on ogbn-proteins without running out of memory.

% \iffalse
% \begin{table}[t]
% \begin{tabular}{llllll}
% Method            & GRACE & GraphMAE & LaGraph & BGRL  & \method & Supervised GCN 17.2G \\
% Memory Complexity & $2c_\text{forward}^\text{GNN}(n+m) + c_\text{backward}^\text{GNN}(n+m) + 2c_\text{forward}^\text{MLP}(n+m) + c_\text{backward}^\text{MLP}(n+m) + c_\text{GRACE}n^2$      &          & $2c_\text{forward}^\text{GNN}(n+m) + c_\text{backward}^\text{GNN}(n+m) + c_\text{forward}^\text{decoder}(n+m) + c_\text{backward}^\text{decoder}(n+m) + c_\text{LaGraph}n$        & $4c_\text{forward}^\text{GNN}(n+m) + c_\text{backward}^\text{GNN}(n+m) + 2c_\text{forward}^\text{predictor}(n+m) + c_\text{backward}^\text{predictor}(n+m) + c_\text{BGRL}n$      & $c_\text{forward}^\text{GNN}(n+m) + c_\text{backward}^\text{GNN}(n+m) + c_\text{\method}bn$        \\
% ogbn-arxiv        & OOM   & 13677    &         & 25443 &         \\
% ogbn-proteins     & OOM   & OOM      & OOM     & OOM   &        
% \end{tabular}
% \end{table}
% \fi

\subsection{Scaling to Large Graphs}
\label{sec:cluterGCN}
Real-world graphs are very large and pose a significant challenge in scalability \citep{tang2023autodifferentiation}. Existing methods resort to sub-sampling techniques such as neighbor-sampling \citep{thakoor2021large}. For \method, neighbor sampling techniques may not work since node proximity measure may not be well-defined for the sampled neighborhood. To scale to very large graphs, we propose a natural extension to our method by leveraging recent advances in efficient GNN training. Cluster-GCN~\citep{clustergcn} proposes to train on subgraphs of clusters partitioned from the original graph to avoid the exponential neighborhood expansion problem. We leverage Cluster-GCN to scale \method to very large graphs by partitioning them into subgraphs and minimize the divergence between node proximity distribution and node representation distribution in each of the subgraphs. We evaluate the effectiveness of \method scaled to large graphs in Section~\ref{large_eval}.
