% \documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithmic}
\usepackage{algorithm}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}

\usepackage{lipsum} 
\usepackage{environ} 

% For theorems and such
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}
\usepackage{amsmath}
\usepackage{hyperref}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography      % colors \LaTeX\ Companio
\usepackage{wrapfig}
\usepackage{color}
\usepackage{adjustbox}
\usepackage{graphics}
\usepackage{enumitem}
% \usepackage[english]{babel}
\usepackage{mathabx}
\usepackage{multirow}
\usepackage[normalem]{ulem}
\useunder{\uline}{\ul}{}

\newtheorem{theorem}{Theorem}

%added colors
\definecolor{mydarkblue}{rgb}{0,0.08,0.45}
\definecolor{myblue}{HTML}{3b75c3}
\definecolor{myred}{HTML}{E33222}
\definecolor{mygreen}{HTML}{438773}
\definecolor{mymaroon}{RGB}{142,27,19}
\definecolor{maroon}{HTML}{800000}
\definecolor{mycite}{cmyk}{0.55,1,0,0.15}
\definecolor{codeblue}{rgb}{0.25,0.5,0.5}
\definecolor{codekw}{rgb}{0.85, 0.18, 0.50}
\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.92}
\hypersetup{
    colorlinks=true,
    citecolor=blue,
    linkcolor=myblue,
    urlcolor=maroon
          }

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{GCVR: Reconstruction from Cross-View Enable \\ Sufficient and Robust Graph Contrastive Learning}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors

\author[1]{\href{mailto:<qwen@nd.edu>?Subject=Your UAI 2024 paper}{Qianlong Wen}}
\author[1]{Zhongyu Ouyang}
\author[2]{Chunhui Zhang}
\author[1]{Yiyue Qian}
\author[3]{Chuxu Zhang}
\author[1]{\href{mailto:<yye7@nd.edu>?Subject=Your UAI 2024 paper}{Yanfang Ye}}
% Add affiliations after the authors
\affil[1]{%
    % Computer Science Dept.\\
    University of Notre Dame\\
    Notre Dame, IN, USA
}
\affil[2]{%
    % Computer Science Dept.\\
    Dartmouth College\\
    Hanover, NH, USA
}
\affil[3]{%
    % Computer Science Dept.\\
    Brandeis University\\
     Waltham, MA, USA
  }
  
  \begin{document}
\maketitle

%%\vspace{-1.5in}

\begin{abstract}
Among the existing self-supervised learning (SSL) methods for graphs, graph contrastive learning (GCL) frameworks usually automatically generate supervision by transforming the same graph into different views through graph augmentation operations. 
The computation-efficient augmentation techniques enable the prevalent usage of GCL to alleviate the supervision shortage issue.  
Despite the remarkable performance of those GCL methods, the InfoMax principle used to guide the optimization of GCL has been proven to be insufficient to avoid redundant information without losing important features.  
In light of this, we introduce the \textbf{G}raph Contrastive Learning with \textbf{C}ross-\textbf{V}iew \textbf{R}econstruction (GCVR), aiming to learn robust and sufficient representation from graph data. 
Specifically, GCVR introduces a cross-view reconstruction mechanism based on conventional graph contrastive learning to elicit those essential features from raw graphs. 
Besides, we introduce an extra adversarial view perturbed from the original view in the contrastive loss to pursue the intactness of the graph semantics and strengthen the representation robustness. 
We empirically demonstrate that our proposed model outperforms the state-of-the-art baselines on graph classification tasks over multiple benchmark datasets.
\end{abstract}

\section{Introduction}
\label{sec: intro}
As a kind of ubiquitous data form, graph-structured data is known for modeling complex interaction systems in the real world.
Among the existing techniques proposed to model the patterns behind graph-structured data, Graph Neural Networks (GNNs)~\citep{GCN, GAT, hamilton_inductive_2017, GIN} have achieved remarkable performance and thereby been employed in many applications, like user preference prediction, recommendation systems, and molecule property prediction \citep{dataform1, OGB, MOF-DDI, ouyang2024improve}. 
% However, they are usually limited by the training paradigm of supervised learning where large amounts of labels are in need, which is very costly or even impossible in practice. 
Despite their success, GNNs are often constrained by the supervised learning paradigm, which necessitates a substantial volume of labeled data—a requirement that is frequently costly or impractical.
Therefore, extracting the rich underlying knowledge from the unlabeled graphs has attracted increasing attention and stimulated a series of research about self-supervised learning (SSL) on graphs \citep{GCC, MVGRL, InfoGraph, MVSE}, where only minimal or no labels are required. 
Existing graph SSL strategies will design different pre-training tasks for optimization to fully utilize the information from unlabeled graphs, where graph contrastive learning (GCL) follows the mutual information maximization principle (InfoMax) \citep{DGI} to maximize the agreements of the positive pairs while minimizing that of the negative pairs in the embedding space. 
However, the GCL paradigm has been empirically and theoretically proved to be insufficient to learn robust and transferable representation~\citep{AD-GCL, zhao_learning_2021}. 
State-of-the-art GCL methods~\citep{GCC, MVGRL, GraphCL} usually rely on applying specific graph augmentation operations (e.g., Edge Removing and Subgraph Sampling) on anchor graph $G$ to generate a positive pair of samples. 
Then, the graph feature encoder $f$ will be trained to ensure representation consistency within the positive pair.
Thus, the choice of augmentation operators and their strength can yield significant impacts on the final optimization result. 
Moderate graph augmentation will push encoders to capture redundant and biased information \citep{on_mutual_info}, adversely impacting the transferability of representations through "shortcut" solutions~\citep{shortcut, automatic_shortcut, CL_shortcut}.  
This is visually depicted in Figure \ref{fig: mutual_info}(a), where the shared part of the two augmentation views includes both predictive information (the overlapping area with $y$) and non-predictive information (shadow area). 
This phenomenon is also empirically proved by many previous GCL works \citep{AD-GCL, RGCL, GASSL}, where lower contrastive loss does not necessarily lead to better performance and robustness, especially under the out-of-distribution (OOD) setting~\citep{ood_general}. 
Conversely, overly aggressive augmentation also proves suboptimal, as it indiscriminately disregards both predictive and trivial features in the absence of additional guiding knowledge ~\citep{CL_FS}.

\begin{figure}[t]
  \centering
  %%\vspace{-0.2in}
  \includegraphics[width=1.\linewidth]{mi2.pdf}
    %%\vspace{-0.2in}
  \caption{Illustration of the relation between graph $G$, label $y$, predictive feature subsets $G^{p}$ and non-predictive feature subset $G^{c}$ in terms of information entropy. Ideally, the green areas in the three figures are null.
  (a) The usual optimization result of graph contrastive learning, where the shared features of two augmentation views are extracted for the learned representation $\mathbf{z}$. Owing to the lack of supervision or domain knowledge, redundant and biased information (shadow area) is usually included in $\mathbf{z}$; (b) $G^{p}$ cover the feature subset which is sufficient to make correct graph label identification ($I(y; G \mid G^{p})=0$), other features ($G^{c}$) is either useless or misguiding; (c) $G^{p}$ and $G^{c}$ are supposed to be mutually excluded with each other ($I(G^{p}; G^{c})=0$).
  The union of them covers all the features of the original data.}
  %%\vspace{-0.25in}
  \label{fig: mutual_info}
\end{figure}

To address the dilemma, recent works \citep{AD-GCL, RGCL, JOAO} propose to modify the existing graph augmentation techniques in an automated manner. 
% These methods assume the most salient sub-structure (those are resistant to graph augmentation) is sufficient for downstream label identification, and thereby add trainable regularization on the graph topological structure to remove the trivial graph sub-structures.
% Despite that these methods can alleviate the aforementioned feature suppression issue to some extent, they still follow the same optimization principle and thus suffer from inherent limitations. 
These methods operate under the premise that the most salient substructures, which are resistant to graph augmentation, are adequate for downstream label identification. Consequently, they introduce trainable regularization on the graph topological structure to eliminate trivial substructures. While these approaches mitigate the feature suppression issue to a degree, they are constrained by the same optimization principles, inheriting similar limitations.
The harsh regularization on graph topology tends to narrow the focus of encoders to 'shallower' features, such as graph size and central node, in the absence of external knowledge \citep{size_ood}, thereby impairing generalization in tasks requiring more comprehensive understanding.
Therefore, the GCL methods, guided by the saliency philosophy, have not yet optimally balanced representation sufficiency and robustness.
To learn better graph representation, an SSL graph method that can better reconcile the information redundancy and sufficiency is in urgent need. 
The optimal representation, as suggested by the information bottleneck (IB) principle \citep{IB}, should extract minimal yet sufficient information during the learning process. Empirical evidence supports the superiority of representations aligning with the IB principle, demonstrating enhanced robustness and transferability across various domains \citep{GIB}.
% Given an input graph $G$, we denote $G^{p}$ and $G^{c}$ as its predictive feature subset and the complementary non-predictive feature subset, respectively.
% According to the assumption of recent studies about rationale invariance discover~\citep{DIR, EERM}, the two features subsets would satisfy $I(y; G \mid G^{p})=0$ (sufficiency condition) and disentanglement condition (i.e., $I(G^{p}; G^{c})=0$). We illustrate the relations among the two feature subsets and $G$ in Figure \ref{fig: mutual_info} (b) and (c).
Consider an input graph $G$, with $G^{p}$ representing its predictive feature subset and $G^{c}$ the complementary non-predictive feature subset. Recent studies on rationale invariance discovery \citep{DIR, EERM} suggest that these subsets satisfy the sufficiency condition $I(y; G \mid G^{p})=0$ and the disentanglement condition $I(G^{p}; G^{c})=0$. The relationships among $G$, $G^{p}$, and $G^{c}$ are depicted in Figure \ref{fig: mutual_info} (b) and (c). 
The representation ideally adhering to the IB principle would include all the features in $G^{p}$ while minimizing the information in $G^{c}$. 
Although it is impossible to eliminate the redundant information across different downstream tasks since there will be a variance between the knowledge required for different applications, a graph representation less suppressed by $G^{c}$ is expected to generalize better on different downstream tasks. 
Furthermore, achieving this goal in the self-supervised setting is even more challenging.


To fill this gap, we propose \textbf{G}raph Contrastive Learning with \textbf{C}ross-\textbf{V}iew \textbf{R}econstruction (GCVR), to pursue the target optimization objective. GCVR is comprised of a graph encoder and two distinct decoders, each specialized in extracting information pertinent to predictive and non-predictive features, respectively. 
% To approximate the disentanglement objective, we propose the cross-view representation reconstruction scheme, including the intra-view and inter-view reconstructions, to reconstruct the originally learned representation with the two separated feature subsets.
The model endeavors to fulfill the disentanglement objective through an innovative cross-view representation reconstruction scheme. This scheme involves both intra-view and inter-view reconstructions, aiming to reconstruct the initial learned representation using the bifurcated feature subsets.
Furthermore, the encoded representation from the original view perturbed in the adversarial fashion serves as the third view when computing the contrastive loss, apart from the predictive relevant representations of the two augmentation views, to further improve the representations' robustness and prevent them from collapsing into partial solution.
% We provide theoretical analysis to show that GCVR is capable of approximating the IB principle and improving the representation robustness without sacrificing sufficiency.
We present a theoretical analysis illustrating GCVR's proficiency in approximating the Information Bottleneck (IB) principle, thus enhancing representation robustness without compromising on sufficiency.
% Finally, we conduct experiments to validate the effectiveness of GCVR over the commonly used graph benchmark datasets. The experimental results demonstrate that GCVR achieves significant performance gains over different datasets and settings compared with state-of-the-art baselines. 
Finally, empirical validation of GCVR's effectiveness is conducted through extensive experiments on public graph benchmark datasets. The experimental results demonstrate that GCVR achieves significant performance gains over different datasets and settings compared with state-of-the-art baselines. 
The main contributions of this work are summarized from three aspects: (i) We propose the GCVR to alleviate the feature suppression issue with the cross-view reconstruction mechanism; (ii) We provide solid theoretical analysis on our model designs; (iii) Thorough experiments are conducted to demonstrate the robustness and transferability of the learned representations via GCVR. The source code of our proposed GCVR is publicly available at \href{https://github.com/HoytWen/GCVR}{https://github.com/HoytWen/GCVR}


\section{Preliminaries}

%%%%\vspace{-0.1in}
\subsection{Graph Representation Learning}
%%%%\vspace{-0.1in}
\label{sec: GRL}
In this work, we focus on the graph-level task, let $\mathcal{G}=\left\{G_{i} = (V_{i}, E_{i})\right\}_{i=1}^{N}$ denote a graph dataset with N graphs, where $V_{i}$ and $E_{i}$ are the node set and edge set of graph $G_{i}$, respectively. We use $x_{v} \in \mathbb{R}^{d}$ and $x_{e} \in \mathbb{R}^{d}$ to denote the attribute vector of each node $v \in V_{i}$ and edge $e \in E_{i}$. Each graph is associated with a label, denoted as $y_{i}$, the goal of the graph representation learning is to learn an encoder $f: G_{i} \rightarrow \mathbb{R}^{d}$ so that the learned representation $\mathbf{z}_{i} = f(G_{i})$ is sufficient to predict $y_{i}$ related to the downstream task. We clarify the sufficiency of $\mathbf{z}_{i}$ as containing 
% the same amount of information as $G_{i}$ for label identification \citep{sufficiency}, 
no less information of the label of $G_{i}$ \citep{sufficiency}, 
and it is formulated as: 
\begin{equation}
    I(G_{i} ; y_{i} \mid \mathbf{z}_{i})=0,
    \label{eq: info_sufficiency}
\end{equation}
where $I\left( ; \right)$ denotes the mutual information between two variables. 
% We demonstrate the general optimization result of classical representation learning in Figure~\ref{fig: intro}(a).

\begin{figure*}[t]
  \centering
  %%\vspace{-0.2in}
  \includegraphics[width=\linewidth]{model2.pdf}
  %%\vspace{-0.3in}
  \caption{The illustration of the proposed GCVR. (1) Graph augmentations are applied to the input graph $G$ to produce two augmented graphs, which are then fed into the shared graph encoder $f(\cdot)$ to generate two graph embeddings $\mathbf{z}_{1}$ and $\mathbf{z}_{2}$. (2) $\mathbf{z}_{1}$ and $\mathbf{z}_{2}$ are used as the inputs of the two decoders to generate two pairs of graph embeddings, $\mathbf{z}^{p}$ captures the predictive factors and $\mathbf{z}^{c}$ keep other complementary non-predictive features. Then we use the two pairs of representations to reconstruct $\mathbf{z}_{1}$ and $\mathbf{z}_{2}$ in both the intra-view and inter-view. (3) An adversarial sample generated by $G$ goes through the same procedure to generate $\mathbf{z}_{adv}^{p}$. We take it as the third view besides $\mathbf{z}_{1}^{p}$ and $\mathbf{z}_{2}^{p}$ in CL guarantee the $\mathbf{z}^{p}$ can guarantee the robustness.}
    %%\vspace{-0.2in}
  \label{fig: model}
\end{figure*}

%%\vspace{-0.05in}
\subsection{Contrastive Learning}
%%%%\vspace{-0.1in}
\label{sec: CL}
Contrastive Learning (CL) is a self-supervised representation learning method that leverages instance-level identity for supervision.
% It follows the InfoMax principle to push the learned representations to agree with each other under proper transformations. 
During the training phase, each graph $G$ firstly goes through proper data augmentation to generate two data augmentation views $t_{1}(G)$ and $t_{2}(G)$, where $t_{1}(\cdot)$ and $t_{2}(\cdot)$ are two augmentation operators. Then, the CL method encourages the encoder $f$ (a backbone network plus a projection layer) to map $t_{1}(x)$ and $t_{2}(x)$ closer in the hidden space so that the learned representations $\mathbf{z}_{1}$ and $\mathbf{z}_{2}$ maintain all the information shared by $t_{1}(G)$ and $t_{2}(G)$. The learning of the encoder is usually directed by a contrastive loss, such as NCE loss \citep{NCE}, InfoNCE loss \citep{InfoNCE}, and NT-Xent loss \citep{SimCLR}. 
In Graph Contrastive Learning (GCL), we usually adopt a GNN, such as GCN \citep{GCN} or GIN \citep{GIN}, as the backbone network, and the commonly-used graph data augmentation operators~\citep{GraphCL}, such as node dropping, edge perturbation, subgraph sampling, and attribute masking. 

All the GCL-based methods are built on the assumption that augmentations do not 
break the sufficiency requirement to make correct predictions. 
Here, we follow \citep{robust_rep} to clear up the definition of mutual redundancy. $t_{1}(G)$ is redundant to $t_{2}(G)$ with respect of $y$ iff $t_{1}(G)$ and $t_{2}(G)$ share the same predictive information. 
Mathematically, the mutual redundancy in CL exists when: 
\begin{equation}
    I(t_{1}(G) ; y \mid t_{2}(G)) = I(t_{2}(G) ; y \mid t_{1}(G)) = 0.
    \label{eq: info_redundancy}
\end{equation}
Although GCL-based methods are usually capable of extracting useful information for label identification, it is unavoidable to include non-predictive features under the SSL setting owing lack of explicit domain knowledge. There exists the situation (e.g., OOD setting) that the latent space of learned representation is dominated by non-predictive features in SSL~\citep{CL_FS} and is not informative enough to make the correct prediction. 
Therefore, feature suppression is not just a prevalent issue in supervised learning, but also in SSL. 
% We give a more detailed discussion about the relationship between feature suppression and GCL in Appendix \ref{appendix: feature_suppression}.


 \subsection{Feature Suppression}

In this section, we will follow the previous works \citep{CL_FS, CL_shortcut} to present a more formal definition of feature suppression and clarify its relation with contrastive learning.
First of all, we assume graph data $G$ has $n$ feature sub-spaces, $G^1, \ldots, G^n$, where each $G^i \in G$ corresponds to a distinct feature of $G$. To quantify the relation between $G$ and its feature sub-spaces, we need to measure the conditional probability of $G$ given a specific kind of feature sub-space $G^i$ ($i \subseteq[n]$), denoted as $p(G \mid G^i)$. Finally, we define an injective map $g: G^i \rightarrow G$ to produce observation $G=g(G^i)$. Due to the reason that $G^i$ is not explicit, so we aim to train an encoder $f: G_{i} \rightarrow \mathbb{R}^{d}$ to map input graph data $G$ into a latent space to extract useful high-level information $\mathbf{z}^{i}$ corresponding to each feature sub-space $G^i$ of input data $G$ during contrastive learning. Therefore, we use $p(G \mid \mathbf{z}^{i})$ as the approximated value of the measurement $p(G \mid G^{i})$. Then we have,
\begin{itemize}
    \item For any feature sub-space $G^i$ and its complementary feature sub-subspace $G^{\bar{i}}$, $f$ suppress feature $i \subseteq[n]$ if we have $p(G \mid \mathbf{z}^{i}) = p(G \mid \mathbf{z}^{\bar{i}})$
    \item For any feature sub-space $G^i$ and its complementary feature sub-subspace $G^{\bar{i}}$, $f$ distinguish feature $i \subseteq[n]$ if $p(G \mid \mathbf{z}^{i})$ and $p(G \mid \mathbf{z}^{\bar{i}})$ have disjoint support. 
\end{itemize}
To sum up, a feature is suppressed if it does not make any difference to the instance discrimination. One of the common acknowledgments for unsupervised learning strategy is that it can usually produce representation with uniform feature space distribution due to the lack of supervision, i.e., every feature sub-space is equally treated without feature suppression. However, it could not be the situation in contrastive learning. Taking the commonly used InfoNCE~\citep{InfoNCE} as an example, it can be divided into two parts, i.e. align term and uniform term~\citep{SimCLR}, as follows:

\begin{equation}
\begin{aligned}
\tau & \mathcal{L}^{\mathrm{InfoNCE}} =\underbrace{-\frac{1}{m} \sum_{i, j} \operatorname{sim}\left(\boldsymbol{z}_i, \boldsymbol{z}_j\right)}_{\mathcal{L}_{\text {alignment }}} \\
& +\underbrace{\frac{\tau}{m} \sum_i \log \sum_{k=1}^{2 m} \mathbf{1}_{[k \neq i]} \exp \left(\operatorname{sim}\left(\boldsymbol{z}_i, \boldsymbol{z}_k\right) / \tau\right)}_{\mathcal{L}_{\text {uniform}}} .
\end{aligned}
\end{equation}
Aligning the positive pair will distinguish the shared feature subspace $G^{i}$. Meanwhile, there also exits random negative samples that might own the same factors in $G^{i}$, so the uniform term might suppress the feature sub-space $G^{i}$. 
Therefore, for any feature $i \subseteq[n]$, the optimization process can either suppress or distinguish it, but both of them can reach lower contrastive loss. From the analysis, we can derive the conclusion mentioned in Section \ref{sec: intro} that lower contrastive loss might not yield better performance. 


\section{Proposed Model}
%%%%\vspace{-0.1in}

In this section, we introduce the details of our proposed GCVR whose framework is shown in Figure \ref{fig: model}. Corresponding theoretical analyses are provided to justify the rationality of our designs. 
Before diving into the details of GCVR, we briefly introduce the overall framework of our model. 
% The proposed GCVR model is designed in accordance with the IB principle to extract minimal yet sufficient representation through the designed cross-view reconstruction mechanism. 
Given $G$ as the input graph instance and $f(\cdot)$ as the graph encoder, we add two decoders to map the graph representation $\mathbf{z}=f(G) \in \mathbb{R}^{d}$ into two different feature spaces $(\mathbf{z}^{p}, \mathbf{z}^{c})$, 
where $\mathbf{z}^{p} \in \mathbb{R}^{d}$ is expected to be specific to the predictive information $G^{p}$, and $\mathbf{z}^{c} \in \mathbb{R}^{d}$ is optimized to elicit the complementary non-predictive factors $G^{c}$. 
Later, we reconstruct the representation $\mathbf{z}$ with the feature subsets mapped from the same and different augmentation views to reduce the overlapping between $\mathbf{z}^{p}$ and $\mathbf{z}^{c}$.
By doing so, we approximate the disentanglement objective demonstrated in Figure \ref{fig: mutual_info} and the $\mathbf{z}^{p}$ is optimized to be the invariant part across different augmentation views.
More importantly, the proposed reconstruction procedure and added adversarial views will push the learned $\mathbf{z}^{p}$ to be as comprehensive and robust as possible for the convenience of representation reconstruction. 
To sum up, instead of implementing harsh regularization on the graph structure, our GCVR proposed a new optimization strategy to 
elicit the most predictive features with the reconstruction task, thereby alleviating the feature suppression issue of the cost of information sufficiency. 
% We further add extra regularization to guarantee $\mathbf{z}^{p}$ does not collapse into shallow or partial features during the reconstruction process. 
More details of GCVR will be introduced next. 

%%\vspace{-0.05in}
\subsection{Disentanglement by Cross-View Reconstruction}
%%%%\vspace{-0.1in}
\label{sec: cross-view}

\noindent
In GCL, we usually leverage a graph encoder, such as a GCN \citep{GCN} or a GIN \citep{GIN}, to encode the graph data into its representation.
There are multiple choices of graph encoders in GCL, including GCN \citep{GCN} and GIN \citep{GIN}, etc. In this work, we adopt GIN as the backbone network $f$ for simplicity. Note that any other commonly used graph encoders can also be adapted to our model. Given two augmentation views $t_{1}(G)$ and $t_{1}(G)$ (where $t_{1}(\cdot)$ and $t_{2}(\cdot)$ are IID sampled from the same family of augmentation $\mathcal{T}$), we firstly use the encoder $f(\cdot)$ to map them into a lower dimension hidden space for the two embeddings, $\mathbf{z}_{1}$ and $\mathbf{z}_{2}$. Instead of directly maximizing the agreement between the two representations $\mathbf{z}_{1}$ and $\mathbf{z}_{2}$, we further feed each of them into a pair decoders $(g_{p}$, $g_{c})$ (both of them are MLP-based networks or GNN and they share the same architecture) and optimize the two decoders to map each of the presentations into the two disentangled feature sub-spaces: 

\begin{equation}
\left[\mathbf{z}^{p} = g_{p}(f(t(G))) \text{,} \; \;
\mathbf{z}^{c} = g_{c}(f(t(G))) \right],
\end{equation}
where a pair of embeddings for both $t_{1}(G)$ and $t_{2}(G)$ are generated.
Ideally, $\mathbf{z}_{1}^{p}$ and $\mathbf{z}_{2}^{p}$ suffice the mutual redundancy assumption stated in \ref{sec: CL} because $t_{1}(G)$ and $t_{1}(G)$ are augmented from the same original graph, and thus naturally share the same predictive factors.

Here, we clarify the lower bound of the mutual information between the augmented view $t_{1}(G)$ and the two learned predictive representations in Theorem \ref{theorem: objective} and we can get the same conclusion with another augmented view $t_{s}(G)$. 

\begin{theorem}
\label{theorem: objective}
Suppose $f(\cdot)$ is a GNN encoder as powerful as 1-WL test. Let $\mathbf{z}_{1}^{p}$ and $\mathbf{z}_{2}^{p}$ be specific to the predictive information of $G$, meanwhile $\mathbf{z}_{1}^{c}$ and $\mathbf{z}_{2}^{c}$ account for the complementary non-predictive factors of $t_{1}(G)$ and $t_{2}(G)$. Then we have:
\begin{equation}
I\left(t_{1}(G) ; \mathbf{z}_{2}^{p}, \mathbf{z}_{2}^{c}\right) \geq I\left( \mathbf{z}_{1}^{p} ; \mathbf{z}_{2}^{p}\right),
\nonumber
\end{equation}
\end{theorem}
\noindent where $G \in \mathcal{G} \text{ and } t_{1}(\cdot), t_{2}(\cdot) \in \mathcal{T}$. 
The detailed proof is provided in Appendix \ref{appendix: proof}. 
Given the lower bound, we substitute the objective by the mutual information between the two representations in the predictive view ($\mathbf{z}_{1}^{p}$ and $\mathbf{z}_{2}^{p}$) to maximize the consistency between the information of the two views. Therefore, we derive the objective function ensuring view invariance as follows:
% Therefore, we can maximize the consistency between the representations of the two views by maximizing the mutual information of between $\mathbf{z}_{1}^{p}$ and $\mathbf{z}_{2}^{p}$. Therefore, we can derive our objective to ensure the view invariance as follow: 

\begin{equation}
\mathcal{L}_{\text{pre}}= \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}_{\text{CL}}(\mathbf{z}_{1, i}^{p}, \mathbf{z}_{2, i}^{p}) ,
\label{eq: inv}
\end{equation}

% where $\mathcal{L}_{\text{CL}}(\cdot)$ denotes the contrastive objective and we adopt InfoNCE loss in this work \citep{InfoNCE}. 
where $\mathcal{L}_{\text{CL}}(\cdot)$ is the adopted InfoNCE loss~\citep{InfoNCE}.
To further pursue the feature disentanglement as illustrated in Figure~\ref{fig: mutual_info}(c), we propose the cross-view reconstruction mechanism. 
To be specific, we would like the representation pair $(\mathbf{z}^{p}, \mathbf{z}^{c})$ within and cross the augmentation views to be able to recover the raw data so that the two objectives can be approached simultaneously.
Due to the fact that graphs are non-Euclidean structured data, we instead try to recover $\mathbf{z}=f(t(G))$ given $(\mathbf{z}^{c}$ and $\mathbf{z}^{p})$. 

More specifically, we first perform the reconstruction within the augmentation view, namely mapping $(\mathbf{z}_{w}^{p}, \mathbf{z}_{w}^{c})$ to $\mathbf{z}_{w}$, where $w \in \{1, 2\}$ representing the augmentation view.  
Then, we define the $(\mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c})$ as a cross-view representation pair and the reconstruction procedure is repeated on it to predict $\mathbf{z}_{w}$, aiming to minimize the overlapping between $\mathbf{z}^{p}$ and $\mathbf{z}^{c}$, where $w=1, w^{\prime}=2$ or $w=2, w^{\prime}=1$. 
Intuitively, the reconstruction process is capable of separating the information of the shared feature sets from the one resided in the unique feature sets between the two augmentation views. 
Since the two IID sampled augmentation operators ($t_{1}(\cdot)$ and $t_{2}(\cdot)$) are expected to preserve the predictive/rational features while varying the augmentation-related ones, we disentangle the rational features from $G$ according to the rationale discover studies~\citep{IRD} to ensure the features' robustness for downstream tasks.
Here, we formulate the reconstruction procedures as: 
\begin{equation}
\mathbf{z}_{w}^{r} = g_{r}\left( \mathbf{z}_{w}^{p} \odot \mathbf{z}_{w}^{c} \right) \text{,} \; \;
\mathbf{z}_{w}^{cr} = g_{r}\left( \mathbf{z}_{w^{\prime}}^{p} \odot \mathbf{z}_{w}^{c} \right) ,
\end{equation}
where $g_{r}$ is the parameterized reconstruction model and $\odot$ is a  
free-to-choose fusion operators, such as element-wise product or concatenation. The reconstruction procedures are optimized by minimizing the entropy $H\left( \mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)$, where $w=w^{\prime}$ or $w \neq w^{\prime}$. 
Ideally, we reach the optimal sufficiency and disentanglement conditions illustrated in Figure \ref{fig: mutual_info}(b) and \ref{fig: mutual_info}(c) iff $H\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right)=-\mathbb{E}_{p\left(  \mathbf{z}_{w}, \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)}\left[\log p\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right)\right]=0$, 
where $\mathbf{z}_{w}$ is exactly recovered given its complementary representation and the predictive representation of any view. Nevertheless, the condition probability $p\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right)$ is 
% unknown for us, 
intractable, we hence use the variational distribution approximated by $g_{r}$ instead, denoted as $q\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right)$.
We provide the upper bound of $H\left( \mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)$ in Theorem \ref{theorem: disentangle}.

\begin{theorem}
\label{theorem: disentangle}
Assume $q$ is a Gaussian distribution, $g_{r}$ is the parameterized reconstruction model which infers $\mathbf{z}_{w}$ from $\left( \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)$. Then we have: 
\begin{equation}
H\left( \mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right) \leq \left\|\mathbf{z}_{w}-  g_{r}\left(\mathbf{z}_{w^{\prime}}^{p} \odot \mathbf{z}_{w}^{c}\right) \right\|_{2}^{2}, 
% \text { where } w=w^{\prime} \text{ or } w \neq w^{\prime}.
\nonumber
\end{equation}
\end{theorem}
where $w=w^{\prime}$ or $w \neq w^{\prime}$. The detailed proof is demonstrated in Appendix \ref{appendix: proof}.
Since we adopt two augmentation views, the objective function constraining representation disentanglement can be formulated as: 
\begin{small}
\begin{equation}
\mathcal{L}_{\text{recon}} = \frac{1}{2N} \sum_{i=1}^{N} \sum_{w=1}^{2} \left[\left\|\mathbf{z}_{w, i}-  \mathbf{z}_{w, i}^{r} \right\|_{2}^{2}+\left\|\mathbf{z}_{w, i}- \mathbf{z}_{w, i}^{cr} \right\|_{2}^{2}\right].
\label{eq: disentanglement}
\end{equation}
\end{small}


%%%%\vspace{-0.15in}
\subsection{Adversarial Contrastive View}
%%%%\vspace{-0.1in}
\label{sec: adv-view}
%% TODO: 解释essential factors above
\noindent With the cross-view reconstruction mechanism above, the two learned representations stated above are optimized toward the disentangled manner. However, it is still necessary to further prevent the learned predictive representation from focusing on the partial features, because we do not have access to the explicit domain knowledge and such a small scope will increase the risk of a shortcut solution. 
Therefore, we extend the Equation \ref{eq: inv} to three contrastive views and add an extra global view without topological perturbation as the third view to guarantee the learned $\mathbf{z}^{p}$ maintain the global semantics instead of partial or even trivial features, i.e., $\mathbf{z}_{1}^{p} \sim G$ and $\mathbf{z}_{2}^{p} \sim G$. 
During the experiments, we find an adversarial graph sample perturbed from the original graph view that can help the model achieve stronger robustness.
A possible explanation is that there is still redundant information that is not predictive left in the shared information of the two $\mathbf{z}^{p}$'s in the two augmentation views, especially when the implemented augmentations are moderate. An adversarial view may further alleviate redundancy.
We define the adversarial objective as follows: 
\begin{equation}
\delta^{*} =
\underset{\left\|\delta \right\|_{\infty} \leq \epsilon}{\operatorname{argmax}}
\mathcal{L}_{\text{adv}}\left(\mathbf{z}_{1}^{p}, \mathbf{z}_{2}^{p}, \mathbf{z}_{\text{adv}}+\delta \right) ,
\label{eq: exp_adversarial}
\end{equation}
where the adversarial sample $\mathbf{z}_{\text{adv}}+\delta$ together with the two augmentation views, i.e., $\mathbf{z}_{1}^{p}$ and $\mathbf{z}_{2}^{p}$ are employed as the positive pair. Our crafted perturbation is spurred by recent work \citep{GASSL} that add perturbation $\delta$ on the output of first hidden layer $\mathbf{h}^{(1)}$, since it is empirically proved to generate more challenging views than adding perturbation on the initial node feature. Therefore, the adversarial contrastive objective is defined as:
\begin{equation}
\mathcal{L}_{\text{adv}} = \frac{1}{N} \sum_{i=1}^{N} \sum_{w=1}^{2}
\underset{\delta^{*}}{\max} \; 
\mathcal{L}_{\text{CL}}\left(\mathbf{z}_{w, i}^{p}, \mathbf{z}_{\text{adv}}+\delta^{*} \right), 
% \left[\mathcal{L}_{\text{CL}}\left(\mathbf{z}_{1, i}^{p}, G+\delta^{*} \right) + \mathcal{L}_{\text{CL}}\left(\mathbf{z}_{2, i}^{p}, G+\delta^{*} \right)\right],
\label{eq: adversarial}
\end{equation}
where the optimized perturbation $\delta^{\prime}$ is solved by projected gradient descent (PGD) \citep{PGD}.
Finally, we derive the joint objective of GCVR by combining all of the objectives above together. The joint objective is as follows:
\begin{equation}
\min_{f, g} \mathbb{E}_{G \in \mathbf{G}} \left[\mathcal{L}_{\text{pre}} +\lambda_{r} \mathcal{L}_{\text{recon}} +\lambda_{a}
\underset{\left\|\delta \right\|_{\infty} \leq \epsilon}{\max}
\mathcal{L}_{\text{adv}} \right] \\,
\end{equation}
where $\lambda_{r}$ and $\lambda_{a}$ are the coefficients to balance the magnitude of each loss term. 
Our proposed model is able to approximate the optimal representation illustrated in Figure \ref{fig: mutual_info}(c) with the joint objective. 
% The pseudo-algorithm of GCVR is provided in Appendix \ref{appendix: alg}


%%%%\vspace{-0.15in}
\section{Experiments}
%%%%\vspace{-0.1in}
In this section, we first demonstrate the empirical evaluation results of our proposed GCVR
and the state-of-the-art baselines 
on multiple public graph benchmark datasets under different settings. 
An ablation study is also included to evaluate the effectiveness of the designs in GCVR. 
We further conduct experiments to study the robustness and the representation disentanglement of the proposed GCVR with extensive experiments. 
More content about dataset statistics, training details, and other empirical analyses are provided in Appendix \ref{appendix: dataste_statistics}, \ref{appendix: implement}, \ref{appendix: rec_loss} and \ref{appendix: hyper}. 


% \begingroup
\begin{table*}[t]
\centering
%%\vspace{-0.2in}
\caption{Overall comparison on multiple graph classification benchmarks under unsupervised learning setting. Results are reported as mean±std\%,  the best performance is bolded and runner-ups are underlined. "-" indicates the result is not reported in the original papers.}
%%\vspace{-0.15in}
\label{tab: res}
\begin{adjustbox}{width=\textwidth,center}
\begin{tabular}{cccccccccccc}
\toprule
  & \textbf{MUTAG} & \textbf{PTC-MR} & \textbf{COLLAB} & \textbf{NCI1} & \textbf{PROTEINS}  & \textbf{IMDB-B}   & \textbf{RDT-B}  & \textbf{IMDB-M} & \textbf{DD} \\
\midrule
node2vec&	72.6±10.2& 58.6±8.0  &   -&	54.9±1.6 &	57.5±3.6&	-&	-&	-&	- \\
% sub2vec&	61.1±15.8&	60.0±6.4&	-&	52.8±1.5& 53.0±5.6 & 	55.3±1.5  & 71.5±0.4  &		36.7±0.8 & -\\
graph2vec&	83.2±9.3&	60.2±6.9 &	-&	73.2±1.8& 73.3±2.1 & 	71.1±0.5  & 75.8±1.0   &	50.4±0.9  & - \\
InfoGraph&	89.0±1.1&	61.7±1.4  &	70.7±1.1&	76.2±1.1 & 74.4±0.3 & 	73.0±0.9  & 82.5±1.4   &	49.7±0.5  & 72.9±1.8 \\
VGAE&	87.7±0.7&	61.2±1.8   & -&	-& - & 	70.7±0.7  & 87.1±0.1   &	49.3±0.4  & - \\
MVGRL&	89.7±1.1&	62.5±1.7   & -&	-& - & 	74.2±0.7   & 84.5±0.6    &	51.2±0.5  & - \\
\midrule
GraphCL & 86.8±1.3 & 63.6±1.8 &	71.4±1.2 &  77.9±0.4 &	74.4±0.5 &  71.1±0.4 &	89.5±0.8 &	- &  \underline{78.6±0.4} \\
InfoGCL&	91.2±1.3& 63.5±1.5 &   80.0±1.3&	80.2±0.6&	-&	75.1±0.9&	-&	51.4±0.8& - \\
DGCL&	\underline{92.1±0.8}&	\underline{65.8±1.5}&	\textbf{81.2±0.3}&	\underline{81.9±0.2}&	\underline{76.4±0.5}&	\textbf{75.9±0.7}&	\underline{91.8±0.2}&	\underline{51.9±0.4}& -\\
AD-GCL&	89.7±1.0& - &   73.3±0.6&	69.7±0.5&	73.8±0.5&	72.3±0.6&	85.5±0.8&	49.9±0.7&	75.1±0.4 \\
RGCL&	87.7±1.0& - &   70.9±0.7&	78.1±1.1&	75.0±0.4&	71.9±0.8&	90.3±0.6&	-&	78.9±0.5\\
GASSL&	90.9±7.9&	64.6±6.1&	78.0±2.0&	80.2±1.9& - & 	74.2±0.5 & - &		51.7±2.5 & -\\
GraphMAE&	91.2±1.3&	-&	80.3±0.5&	80.4±0.3& 75.3±0.4 & 	75.5±0.7 & 88.0±0.2 &		51.6±0.5 & -\\
\midrule
\textbf{GCVR}&	\textbf{92.3±0.7}&	\textbf{67.4±1.3}&	\underline{80.5±0.5}&	\textbf{82.0±1.0}&	\textbf{76.8±0.4}&	\underline{75.6±0.4}&	\textbf{92.4±0.9}&	\textbf{52.2±0.5}&	 \textbf{80.5±0.5} \\
\bottomrule
\end{tabular}
\end{adjustbox}
%%\vspace{-0.25in}
\end{table*}
% \endgroup

%%%%\vspace{-0.1in}
\subsection{Experimental Setups}
%%%%\vspace{-0.1in}

\noindent
\textbf{Datasets. }For the unsupervised learning setting, we evaluate our model on five graph benchmark datasets from the field of bioinformatics, including MUTAG, PTC-MR, NCI1, DD, and PROTEINS, and the other four are from the field of social networks, which are COLLAB, IMDB-B, RDT-B, and IMDB-M, for the task of graph-level property classification. 
For the transfer learning setting, we follow previous work~\citep{GraphCL, GraphLoG} to pre-train our model on the ZINC-2M dataset, which contains 2 million unlabeled molecule graphs sampled from MoleculeNet~\citep{moleculenet}, then evaluate its performance on eight binary classification datasets from chemistry domain, where the eight datasets are split according to the scaffold to simulate the out-of-distribution scenario in real-world. 
Additionally, We use ogbg-molhiv from Open Graph Benchmark Dataset \citep{OGB} to evaluate our model over large-scale datasets under the semi-supervised setting. We provide more details about dataset statistics in Appendix \ref{appendix: dataste_statistics}.

\noindent
\textbf{Baselines. }
Under the unsupervised learning setting, 
we compare GCVR with the eight SOTA self-supervised learning methods, including GraphCL \citep{GraphCL}, InfoGraph\citep{InfoGraph}, MVGRL \citep{MVGRL},  AD-GCL\citep{AD-GCL}, GASSL\citep{GASSL}, InfoGCL\citep{InfoGCL}, RGCL~\citep{RGCL} and DGCL\citep{DGCL}, as well as three classical unsupervised representation learning methods, including node2vec \citep{node2vec}, 
% sub2vec \citep{sub2vec}, 
graph2vec \citep{graph2vec}, and GVAE \citep{VGAE}. 
For the transfer learning setting, we employ AttrMasking~\citep{pretrain_strategies}, ContextPred~\citep{pretrain_strategies}, GraphCL~\citep{GraphCL}, GraphLoG~\citep{GraphLoG}, AD-GCL~\citep{AD-GCL}, RGCL~\citep{RGCL} and GraphMAE~\citep{GraphMAE} as baselines to evaluate the effectiveness of our proposed GCVR.
Besides, we also compare our proposed methods with GraphCL, SimGRACE~\cite{simgrace}, AutoGCL~\cite{autogcl} and DCL as the baselines to evaluate the effectiveness of our proposed GCVR under the semi-supervised learning setting. 

\noindent
\textbf{Evaluation Protocol. }For the unsupervised setting,  we follow the evaluation protocols of previous works \citep{InfoGraph, GraphCL} to verify the effectiveness of our model. 
The learned representation is fine-tuned by a linear SVM classifier for task-specific prediction.
We report the mean test accuracy evaluated by 10-fold cross-validation with the standard deviation of five random seeds as the final performance.
For the transfer learning setting, we follow the finetuning procedures of previous work~\citep{GraphCL} and report the mean ROC-AUC scores with a standard deviation of 10 repeated runs on each downstream dataset as the final performance. 
In addition, we follow the setting of semi-supervised representation learning from GraphCL on the ogbg-molhiv dataset, with the finetune label rates as 1\%, 10\%, and 20\%. 
The final performance is reported as the mean ROC-AUC score of five repeated runs with different initialization random seeds.
\label{sec: exp_setup}


%%%%\vspace{-0.1 in}
\subsection{Overall Performance Comparison}
%%%%\vspace{-0.1 in}
\label{sec: exp_res}
\textbf{Unsupervised learning. }The overall performance comparison is shown in Table~\ref{tab: res} and we can have three observations: (1) Graph Contrastive Learning (GCL)-based methods consistently outperform traditional unsupervised learning techniques, underscoring the benefits of incorporating instance-level supervision.
(2) The models RGCL, AD-GCL, and GASSL exhibit advantages compared to GraphCL. This finding lends empirical support to the hypothesis that the InfoMax objective may introduce excessive redundant information, leading to issues with feature suppression.
(3) Notably, our proposed models, GCVR and DGCL, consistently surpass other baseline models in performance, illustrating the efficacy of disentangled representation. Particularly, GCVR achieves state-of-the-art results on the majority of datasets, highlighting its effectiveness in this domain. We think the possible reason behind the impressive performances of DGCL on COLLAB and IMDB-B could stem from its adaptable setting of the disentanglement head number, which enables larger hyperparameter search space but also requires more effort to find the best configuration. Though less optimal on COLLAB and IMDB-B compared with DGCL, our proposed GCVR achieves the best performance on all the other datasets under an unsupervised learning setting. 


% \begingroup
\begin{table*}[t]
\centering
% %%\vspace{-0.2in}
\caption{Overall comparison on multiple graph classification benchmarks under transfer learning setting. Results are reported as mean±std\%,  the best performance is bolded and runner-ups are underlined.}
%%\vspace{-0.15in}
\label{tab: transfer}
% \setlength{\tabcolsep}{3pt}
% \begin{adjustbox}{width=\textwidth,center}
\begin{tabular}{ccccccccccc} 
\toprule
  & \textbf{BBBP} & \textbf{Tox21} & \textbf{ToxCast} & \textbf{SIDER} & \textbf{ClinTox}  & \textbf{MUV}   & \textbf{HIV}  & \textbf{BACE} & \textbf{AVG} \\
\midrule
No Pre-Train &65.8±4.5    & 74.0±0.8 &	63.4 ±0.6&	57.3±1.6&	58.0±4.4&	71.8±2.5&	75.3±1.9&	70.1±5.4 &	67.0\\
AttrMasking &64.3±2.8    & \textbf{76.7±0.4} &	\textbf{64.2±0.5}&	61.0±0.7&	71.8±4.1&	74.7±1.4&	77.2±1.1&	79.3±1.6 &	71.1\\
ContextPred &68.0±2.0	&75.7±0.7	&63.9±0.6	&60.9±0.6	&65.9±3.8	&75.8±1.7	&77.3±1.0	&79.6±1.2 &	70.9\\
GraphCL &69.5±0.5   & 75.4±0.9 &	63.8±0.4&	60.8±0.7&	70.1±1.9&	74.5±1.3&	77.6±0.9&	78.2±1.2 &	70.8\\
GraphLoG & \textbf{72.5±0.8}    & 75.7±0.5 &	63.5±0.7&	61.2±1.1&	76.7±3.3&	76.0±1.1&	77.8±0.8&	\textbf{83.5±1.2} &	73.4\\
JOAO &70.2±1.0    & 75.0±0.3 &	62.9±0.5&	60.0±0.8&	81.3±2.5&	71.7±1.4&	76.7±1.2&	51.5±0.4 &	71.9\\
RGCL &71.4±0.7    & 75.2±0.3 &	63.3±0.2&	\underline{61.4±0.6}&	\underline{83.4 ±0.9}&	\textbf{76.7 ±1.0}&	\underline{77.9±0.8}&	76.0±0.8 &	73.2\\
GraphMAE &72.0±0.6    & 75.5±0.6 &	\underline{64.1±0.3}&	60.3±1.1&	82.3 ±1.2&	76.3 ±2.4&	77.2±1.0&	\underline{83.1±0.9} &	\underline{73.8}\\
\midrule
\textbf{GCVR}&	\underline{72.1±0.5}&	\underline{75.9±0.6}&	63.0±0.5&	\textbf{62.2±0.7}&	\textbf{83.6±1.5}&	\underline{76.6±0.7}&	\textbf{78.1±1.1}&	80.8±1.8 &	\textbf{74.0}\\
\bottomrule
\end{tabular}
% \end{adjustbox}
%%\vspace{-0.2in}
\end{table*}
% \endgroup

%% TODO: check result
% %%%%\vspace{-0.05 in}
\textbf{Transfer learning. } Table \ref{tab: transfer} presents the experimental outcomes in the context of transfer learning. In this setting, the 'No Pre-Train' approach omits the self-supervised pre-training phase on the ZINC-2M dataset before the fine-tuning process. 
% Notably, several robust baselines, such as AttrMasking and ContextPred, benefit from the incorporation of domain-specific knowledge during training. Despite the absence of such specialized knowledge, our proposed model, GCVR, demonstrates commendable performance, securing the best or second-best results on six out of the eight datasets examined. Moreover, our model ranks first in terms of average performance across datasets. In the meantime, JOAO, RGCL, and GCVR, all derivatives of GraphCL, surpass GraphCL in average performance. This finding empirically shows the detrimental impact of biased information in these models and highlights the critical need for strategies to mitigate such biases.
The results presented in Table 2 illustrate that no baseline achieves consistently superior performance across all eight datasets, which includes advanced models like GraphLoG and GraphMAE. On the other hand, several competitive baselines, such as AttrMasking and ContextPred, benefit from the incorporation of domain-specific knowledge during training~\cite{pretrain_strategies}, however, all the graph SSL baselines and our proposed GCVR are in the absence of such specialized knowledge. Under this condition, our proposed GCVR still achieves the best performance on three of the eight datasets and the second-best performance on another three datasets. Notably, GCVR achieves the highest average performance across the datasets. These results demonstrate GCVR's notable proficiency in addressing transfer learning challenges, affirming its effectiveness in this demanding context. In the meantime, JOAO, RGCL, and GCVR, all derivatives of GraphCL, surpass GraphCL in average performance. This finding empirically shows the detrimental impact of biased information in these models and highlights the critical need for strategies to mitigate such biases.



% %%%%\vspace{-0.1 in}
\noindent
\textbf{Semi-supervised learning. }The experimental results, illustrated in Figure~\ref{fig: sub_abl}, show that our model significantly outperforms all baselines across three label-rate fine-tuning scenarios. Notably, there is a clear correlation between increasing label rates and performance improvements, with gains of 1.3\%, 2.4\%, and 2.9\% observed at label rates of 1\%, 10\%, and 20\%, respectively. This trend may be explained by the hypothesis that higher volumes of trainable data introduce more redundant information, thereby exacerbating the feature suppression problem. The effective removal of this redundant information is crucial, as it seems to play a key role in the observed enhancements in performance.









\begin{figure}[t]
  \centering
  % %%\vspace{-0.15in}
  \includegraphics[width=1.\linewidth]{sup_abl.pdf}
    %%\vspace{-0.35in}
  \caption{(right) Performance comparison of semi-supervised learning on ogbg-molhiv. (left) Performance comparison between the GCVR and four model variants.}
  %%\vspace{-0.15in}
  \label{fig: sub_abl}
\end{figure}

% %%\vspace{-0.2 in}
\subsection{Ablation Study}
%%%%\vspace{-0.1 in}

To further assess the individual contributions of the various modules in our proposed GCVR, we conducted ablation studies. These studies involved the construction of two modified versions of the model: (1)\textbf{w/o Intra}, which excludes the intra-view reconstruction; (2)\textbf{w/o Inter}, which excludes the inter-view reconstruction; (3) \textbf{w/o CV Recon}, which completely excludes the cross-view reconstruction process; and (4) \textbf{w/o Adv. Training}, which omits the adversarial training component. The performance of these variants is demonstrated in the left subplot of Figure~\ref{fig: sub_abl}. An analysis of the results indicates that the integration of reconstruction from both views and adversarial view in the GCVR model yields superior performance compared to the variants. The absence of the reconstruction process from either view can impede the model's ability to optimize representations in a disentangled fashion, as evidenced in Figure \ref{fig: mutual_info}(c). This omission leads to persistent issues with feature suppression in the resultant representations. Additionally, the model variant lacking the adversarial view exhibits tendencies towards representation collapse and accrues unnecessary redundant information, resulting in less optimal performance in downstream tasks.





% \begingroup
% \begin{table*}
% \centering
% %%%%\vspace{-0.1in}
% \caption{Overall comparison of the model variants’ performance. Results are reported as mean±std\%.}
% %%\vspace{-0.15in}
% \label{tab:abl}
% \setlength{\tabcolsep}{3pt}
% \begin{adjustbox}{width=\textwidth,center}
% \begin{tabular}{ccccccccccc} 
% \toprule
%   & \textbf{MUTAG} & \textbf{PTC-MR} & \textbf{COLLAB} & \textbf{NCI1} & \textbf{PROTEINS}  & \textbf{IMDB-B}   & \textbf{RDT-B}  & \textbf{IMDB-M} & \textbf{DD}\\
% \midrule
% % w/o Intra Recon &91.5±1.2    & 65.8±1.3 &	78.4±0.7&	79.6±0.7&	75.6±0.5&	74.2±0.8&	92.0±0.4&	51.5±0.4&	55.8±0.6&	79.3±0.7\\
% % w/o Inter Recon &91.0±0.9    & 64.7±1.4 &	78.0±0.8&	78.7±1.2&	74.9±0.7&	75.0±0.6&	91.1±0.7&	50.8±0.2&	55.6±0.4&	79.0±0.8\\
% w/o CV Recon &91.0±0.9    & 64.7±1.4 &	78.0±0.8&	78.7±1.2&	74.9±0.7&	75.0±0.6&	91.1±0.7&	51.7±0.6&	79.0±0.8\\
% w/o Adv. Training &92.1±0.6	&66.8±0.5	&76.5±0.6	&81.2±0.9	&76.0±0.3	&75.1±0.6	&92.2±1.0	&50.8±0.4	&80.1±0.6\\
% \textbf{GCVR}&	\textbf{92.3±0.7}&	\textbf{67.4±0.5}&	\textbf{80.5±0.5}&	\textbf{82.0±1.0}&	\textbf{76.8±0.4}&	\textbf{75.6±0.4}&	\textbf{92.5±0.9}&	\textbf{52.2±0.5}&	\textbf{80.5±0.5}\\
% \bottomrule
% \end{tabular}
% \end{adjustbox}
% %%\vspace{-0.25in}
% \end{table*}
% \endgroup



%%%%\vspace{-0.15in
\subsection{Robustness Analysis}
% %%%%\vspace{-0.1 in}

In this section, additional experiments are conducted on the ogbg-molhiv dataset to assess the robustness of representation under aggressive augmentation and perturbation. The corresponding results are presented in the left two subplots of Figure~\ref{fig: aug}. Our method is compared with GASSL across varying perturbation bounds and attack steps to evaluate their resiliences against adversarial attacks. Given that both our model and GASSL utilize the GIN as the underlying backbone network, the performance of GIN is also included as a baseline for comparison. 
Despite the notable performance decline induced by aggressive adversarial attacks, our proposed GCVR model achieves comparable or better performance with GIN across a majority of perturbation scenarios and demonstrates more impressive resilience than that of GASSL.


%  %%%%\vspace{-0.1in}
\begin{figure}[t]
  \centering
  \includegraphics[width=\linewidth]{adv.pdf}
 %%\vspace{-0.3in}
  \caption{The model performance on ogbg-molhiv under different perturbation bound and attack steps.
  % and sensitivity analysis of the two important hyper-parameters (i.e., $\lambda_{r}$ and $\lambda_{a}$).
  }
  %%\vspace{-0.25in}
  \label{fig: aug}
\end{figure}

% \begingroup
\begin{table*}[t]
\centering
%%\vspace{-0.2in}
\caption{Performance comparison of the two learned representations. Results are reported as mean±std\%.}
\label{tab: disen}
%%\vspace{-0.15in}
\begin{adjustbox}{width=\textwidth,center}
\begin{tabular}{cccccccccccc} 
\toprule
  & \textbf{MUTAG} & \textbf{PTC-MR} & \textbf{COLLAB} & \textbf{NCI1} & \textbf{PROTEINS}  & \textbf{IMDB-B}   & \textbf{RDT-B} & \textbf{IMDB-M}  & \textbf{DD} & \textbf{ogbg-molhiv}\\
\midrule
$\mathbf{z}^{c}$ &88.1±1.2  & 58.6±2.0   & 75.1±0.7&	72.2±2.0&	73.5±0.8&	71.8±0.9&	89.4±1.0&    47.8±0.9    &	75.8±0.6&	69.70±2.8\\
$\mathbf{z}^{p}$&	\textbf{92.3±0.7}&   \textbf{67.4±1.3}   &	\textbf{80.5±0.5}&	\textbf{82.0±1.0}&	\textbf{76.8±0.4}&	\textbf{75.6±0.4}&	\textbf{92.5±0.9}&	 \textbf{52.2±0.5}   &   \textbf{80.5±0.5} &\textbf{75.36±1.4}\\
\bottomrule
\end{tabular}
%%\vspace{-0.2in}
\end{adjustbox}
\end{table*}
% \endgroup

\subsection{Disentanglement Analysis}

To investigate whether the feature suppression problem is equally serious in $\mathbf{z}^{p}$ and $\mathbf{z}^{c}$, we conduct experiments to compare the performance of the two representations on downstream tasks. The comparison results are shown in Table~\ref{tab: disen}.
It is easy to observe that there is a large performance gap between the two learned representations, indicating the different feature suppression issues between them and the features subset that are more robust to augmentation is more informative and transferable than those sensitive to augmentations.
To further study the influence of the disentanglement design in GCVR on the optimization process, we use the InfoNCE loss \citep{InfoNCE} to dynamically measure the representation difference between the two augmentation graph views based on the two disentangled representations.
% where blue lines indicate the InfoNCE loss between $\mathbf{z}^{p}_{1}$ and $\mathbf{z}^{p}_{2}$ and orange lines represent the InfoNCE loss between $\mathbf{z}^{c}_{1}$ and $\mathbf{z}^{c}_{2}$. 
For simplicity, we only demonstrate the first 100 pre-training epochs of PROTEINS and COLLAB in Figure \ref{fig: CL_Loss}, we can observe similar phenomena on other datasets. From the loss curves in Figure \ref{fig: CL_Loss} we can find that contrastive loss between predictive representations, i.e., $\mathbf{z}^{p}$, gradually decreases, indicating the predictive representation is optimized to capture all the shared information between the two augmentation views. Meanwhile, the loss between the non-predictive representations, i.e., $\mathbf{z}^{c}$, achieves a noticeable increase, which is consistent with our expectation that the two independent sampled augmentation operators cause a distribution shift between the two augmentation views.  
Given the empirical analysis above, we believe our proposed GCVR can further alleviate the feature suppression issue with the disentanglement design. 
% More experiments and discussions about representation disentanglement and hyper-parameter sensitivity are provided in Appendix \ref{sec: exp_disentangle} and \ref{appendix: hyper}. 


\begin{figure}[t]
  \centering
  %%\vspace{-0.2in}
  \includegraphics[width=\linewidth]{CL.pdf}
 %%\vspace{-0.3in}
  \caption{InfoNCE loss of the two disentangled representations, where orange lines are the InfoNCE loss between $\mathbf{z}_{1}^{c}$ and $\mathbf{z}_{2}^{c}$, blue lines are the InfoNCE loss between $\mathbf{z}_{1}^{p}$ and $\mathbf{z}_{2}^{p}$.}
  %%\vspace{-0.25in}
  \label{fig: CL_Loss}
\end{figure}

% %%%%\vspace{-0.2 in}
\section{Related Work}
% %%%%\vspace{-0.1in}
\noindent
\textbf{Graph contrastive learning. }Contrastive learning was first proposed in the compute vision field \citep{SimCLR} and has raised a surge of interest in the area of self-supervised graph representation learning. The principle behind contrastive learning is to utilize the instance-level identity as supervision and maximize the consistency between positive pairs in hidden space through the designed contrast mode. Previous graph contrastive learning works generally rely on various graph augmentation techniques \citep{DGI, GCC, MVGRL, GraphCL, GSR} to generate positive pairs from original data as similar samples. Recent works in this field try to improve the effectiveness of graph contrastive learning by finding a more challenging view \citep{AD-GCL, JOAO} or adding adversarial perturbation \citep{GASSL}. However, most of the existing methods suffer from the feature suppression issue in contrastive learning \citep{CL_FS, CL_shortcut, zhang2023sparsity}, where the predictive features and trivial ones are equally possible to be omitted during the training phase. 
Our model is spared from the issue by proposing corresponding designs to discern the essential features from those trivial and easily disturbed ones.

\noindent
\textbf{Disentangled representation learning. }Disentangled representation learning arises from the computer vision field \citep{hsieh_learning_2018, zhao_learning_2021} to disentangle the heterogeneous latent factors of the representations, therefore making the representations more robust and interpretable \citep{RLReview}.
This idea has now been widely adopted in graph representation learning,
\citep{IPGDN, DisenGCN} utilizes a neighborhood routing mechanism to identify the latent factors in the node representations. 
Some other generative models \citep{VGAE, GraphVAE} utilize Variational Autoencoders to balance reconstruction and disentanglement. The study of learning disentangled representations also outspreads self-supervised graph learning \citep{DGCL} by contrasting the factorized representations. 
Recent works \citep{DDHGNN} further demonstrate the impressive robustness and explainability of disentangled representations in dynamic graphs. 
Despite the significant benefit obtained from the representation disentanglement, the underlined excessive information could still overload the model, thus resulting in limited capacities. Our model targets the issue by removing the redundant information that is considered irrelevant to the graph property. 

\noindent
\textbf{Graph information bottleneck. }The Information bottleneck (IB) \citep{IB} has been widely adopted as a critical principle of representation learning. A representation containing minimal yet sufficient information is considered to be in compliance with the IB principle and many works \citep{VIB, blackbox, robust_rep} have empirically and theoretically proved that representation agrees with the IB principle is both informative and robust. Recently, the IB principle is also borrowed to guide the representation learning of graph structure data. Current methods \citep{GIB, InfoGCL, AD-GCL, RGCL} usually design different regularizations to learn compressed yet informative representations following the IB principle. We follow the information bottleneck to learn the expressive and robust representation in this work. 

%%\vspace{-0.05in}
\section{Conclusion}
%%\vspace{-0.05in}
\label{sec: conclusion}
In this paper, we study 
% to alleviate 
the feature suppression problem in self-supervised graph representation learning. To avoid the predictive features being suppressed in learned representation, we propose a novel model, namely GCVR, which is designed following the information bottleneck principle. The cross-view reconstruction in GCVR can disentangle those more robust and transferable features from those trivial ones.
Meanwhile, we also add an adversarial view as the third view of contrastive learning to guarantee the global semantics and further enhance representation robustness. In addition, we theoretically analyze the working mechanism of our design and derive the objective based on the analysis. Extensive experiments on multiple graph benchmark datasets and different settings prove the ability of GCVR to learn robust and transferable graph representation. In the future, we can explore how to come up with a practical objective to further decrease the upper bound of the mutual information between the disentangled representations and try to utilize more efficient training strategies to make the proposed model more time-saving on large-scale graphs. 

\section*{Acknowledgements}
This work was partially supported by the NSF under grants IIS-2321504, IIS-2334193, IIS-2340346, IIS-2203262, IIS-2217239, CNS-2203261, and CMMI-2146076. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors. 


% References
\bibliography{uai2024-template}

\newpage

\onecolumn

\title{Graph Contrastive Learning with Cross-View Reconstruction\\(Supplementary Material)}
\maketitle

\appendix

\section{Training Algorithm}
\label{appendix: alg}
In this section, we summarized the details of our proposed method in the following Algorithm.
\begin{algorithm}
\caption{The training algorithm of \textbf{GCVR}}\label{alg:HDGAT}
\begin{algorithmic}
    % \SetKwInOut{Input}{Input}
    % \SetKwInOut{Output}{Output}
    \STATE {\bfseries Input:} Graph dataset $\mathcal{G}=\left\{G_{i} = (V_{i}, E_{i})\right\}_{i=1}^{N}$; augmentation family $\mathcal{T}$; loss coefficient $\lambda_{r}$, $\lambda_{a}$; ascernt step $T$; ascent step size $\alpha$; perturbation bound $\epsilon$. 
    \STATE {\bfseries Output:} The disentangled predictive representations $\mathbf{Z}^{p}=\left\{ \mathbf{z}_{i}^{p} \right\}_{i=1}^{N}$
    \FOR{each training epoch}
    \FOR{sampled minibatch $\mathcal{B}=\left\{G_{i}\right\}_{i=1}^{|\mathcal{B}|}$}
    \FOR{$G_{i} \in \mathcal{B}$}
    \STATE $\mathbf{z}_{1, i}=f\left(t_{1}(G_{i})\right)$, $\mathbf{z}_{2, i}=f\left(t_{2}(G_{i})\right)$, where $t_{1}(\cdot), t_{2}(\cdot) \in \mathcal{T}$
    \STATE $\mathbf{z}_{1, i}^{p}=g_{p}\left( \mathbf{z}_{1, i} \right)$, $\mathbf{z}_{2, i}^{p}=g_{p}\left( \mathbf{z}_{1, i} \right)$
    \STATE $\mathbf{z}_{1, i}^{c}=g_{c}\left( \mathbf{z}_{1, i} \right)$, $\mathbf{z}_{2, i}^{c}=g_{c}\left( \mathbf{z}_{1, i} \right)$
    \ENDFOR
    \STATE Calculate $\mathcal{L}_{\text{pre}}$ according to Equation 6
    \STATE Calculate $\mathcal{L}_{\text{recon}}$ according to Equation 8
    \STATE $\mathcal{L} \leftarrow \mathcal{L}_{\text{pre}} + \lambda_{r} \mathcal{L}_{\text{recon}}$
    \STATE $\delta_{0} \leftarrow U(-\epsilon, \epsilon)$
    \ENDFOR 
    \FOR{each $t=1$ to $T$}
    \STATE Calculate the $\mathcal{L}_{\text{adv}}$ according to Equation 10
    \STATE $\delta_{t} \leftarrow \delta_{t-1} + \alpha\nabla_{\delta} \mathcal{L}_{\text{adv}}$ \; \; \; Update perturbation to maximize $\mathcal{L}_{\text{adv}}$
    \STATE $\mathcal{L} \leftarrow \mathcal{L} + \frac{\lambda_{a}}{T} \mathcal{L}_{\text{adv}}$ 
    \ENDFOR
    \STATE Update the parameter $\theta$ of $f$ and $g$ with the gradient $\nabla_{\theta} \mathcal{L}(\theta, \mathcal{B})$ over a minibatch; 
    \ENDFOR
    \STATE $\mathbf{Z}^{p}=\left\{ \mathbf{z}_{i}^{p} \right\}_{i=1}^{N}$, where $\mathbf{z}_{i}^{p} = g_{p}\left( f(G_{i}) \right)$ 
\end{algorithmic}
\end{algorithm}

\vspace{-0.2in}


\begin{figure*}[t]
    \centering
    \vspace{-0.1in}
    \includegraphics[width=1\linewidth]{ood_case.pdf}
    \vspace{-0.2in}
    \caption{An out-of-distribution situation in molecule graph prediction task. The casual functional sub-structure (red) are spuriously correlated with different trivial sub-structures in training and test set. The statistical correlations can lead to poor robustness and transferability.}
    % %%\vspace{-0.1in}
    \label{fig: ood}
\end{figure*}

\section{Out-of-distribution Scenario on Graph}
\label{appendix: ood_case}

In this section, we will illustrate the out-of-distribution scenario in the graph learning task. During molecule property study, A specific kind of property (e.g., toxicity and lipophilicity) of a molecule is usually dependent on if it has corresponding sub-structures (termed as functional group). For example, hydrophilic molecules usually have the oxhydryl group ($-OH$)
Therefore, a well-trained GNN model on molecule graph prediction task is capable of reflecting the sub-structure information in the graph representation. However, it is usually the case in a real-world scenario that the predictive functional group is usually accompanied by some irrelevant groups in some environments, thus causing spurious correlations. This correlation usually leads to poor generalization performance when the model is evaluated in another environment with different spurious correlations. Figure \ref{fig: ood} intuitively demonstrates this kind of scenario, where the red subgraph is the feature we can rely on to make the casual prediction. But it usually shows up with a green subgraph that does not serve as the functional graph of the property in the training set. Consequently, the model is easily misguided that the green subgraph is an important indicator of the property. When we evaluate the model on the testing set where the casual graph is correlated with another kind of group (yellow subgraph), there usually exists a huge gap between its performances on the two sets. 




% \section{Feature Suppression}
% \label{appendix: feature_suppression}

% In this section, we will follow the previous works \citep{CL_FS, CL_shortcut} to present a more formal definition of feature suppression and clarify its relation with contrastive learning.
% First of all, we assume graph data $G$ has $n$ feature sub-spaces, $G^1, \ldots, G^n$, where each $G^i \in G$ corresponds to a distinct feature of $G$. To quantify the relation between $G$ and its feature sub-spaces, we need to measure the conditional probability of $G$ given a specific kind of feature sub-space $G^i$ ($i \subseteq[n]$), denoted as $p(G \mid G^i)$. Finally, we define an injective map $g: G^i \rightarrow G$ to produce observation $G=g(G^i)$. Due to the reason that $G^i$ is not explicit, so we aim to train an encoder $f: G_{i} \rightarrow \mathbb{R}^{d}$ to map input graph data $G$ into a latent space to extract useful high-level information $\mathbf{z}^{i}$ corresponding to each feature sub-space $G^i$ of input data $G$ during contrastive learning. Therefore, we use $p(G \mid \mathbf{z}^{i})$ as the approximated value of the measurement $p(G \mid G^{i})$. Then we have,
% \begin{itemize}
%     \item For any feature sub-space $G^i$ and its complementary feature sub-subspace $G^{\bar{i}}$, $f$ suppress feature $i \subseteq[n]$ if we have $p(G \mid \mathbf{z}^{i}) = p(G \mid \mathbf{z}^{\bar{i}})$
%     \item For any feature sub-space $G^i$ and its complementary feature sub-subspace $G^{\bar{i}}$, $f$ distinguish feature $i \subseteq[n]$ if $p(G \mid \mathbf{z}^{i})$ and $p(G \mid \mathbf{z}^{\bar{i}})$ have disjoint support. 
% \end{itemize}
% To sum up, a feature is suppressed if it does not make any difference to the instance discrimination. One of the common acknowledgments for unsupervised learning strategy is that it can usually produce representation with uniform feature space distribution due to the lack of supervision, i.e., every feature sub-space is equally treated without feature suppression. However, it could not be the situation in contrastive learning. Taking the commonly used InfoNCE~\citep{InfoNCE} as an example, it can be divided into two parts, i.e. align term and uniform term~\citep{SimCLR}, as follows:

% \begin{equation}
% \begin{aligned}
% \tau & \mathcal{L}^{\mathrm{InfoNCE}} =\underbrace{-\frac{1}{m} \sum_{i, j} \operatorname{sim}\left(\boldsymbol{z}_i, \boldsymbol{z}_j\right)}_{\mathcal{L}_{\text {alignment }}} +\underbrace{\frac{\tau}{m} \sum_i \log \sum_{k=1}^{2 m} \mathbf{1}_{[k \neq i]} \exp \left(\operatorname{sim}\left(\boldsymbol{z}_i, \boldsymbol{z}_k\right) / \tau\right)}_{\mathcal{L}_{\text {uniform}}} .
% \end{aligned}
% \end{equation}
% Aligning the positive pair will distinguish the shared feature subspace $G^{i}$. Meanwhile, there also exits random negative samples that might own the same factors in $G^{i}$, so the uniform term might suppress the feature sub-space $G^{i}$. 
% Therefore, for any feature $i \subseteq[n]$, the optimization process can either suppress or distinguish it, but both of them can reach lower contrastive loss. From the analysis, we can derive the conclusion mentioned in Section \ref{sec: intro} that lower contrastive loss might not yield better performance. 

\section{Summary of Datasets}
\label{appendix: dataste_statistics}
In this work, we use nine datasets from TU Benchmark Datasets \citep{TUDataset} to evaluate our proposed GCVR under the unsupervised setting, where five of them are biochemical datasets and the other four belong to social network datasets.
We also utilize the ogng-molhiv dataset from Open Graph Benchmark (OGB) \citep{OGB} to further evaluate GCVR under the semi-supervised setting.  Besides, the datasets sampled from MoleculeNet~\citep{moleculenet} are employed to evaluate our model under the transfer learning setting. The statistics of these datasets are shown in Table \ref{tab: tu_data_stat} and \ref{tab: transfer_data_stat}. 

\begin{table*}
\centering
\caption{Statistics of TU-datasets and OGB dataset. }
\vspace{-0.15in}
\begin{tabular}{ccccccc} 
\toprule
  \textbf{Dataset} & \textbf{\#Graphs} & \textbf{Avg \#Nodes} & \textbf{Avg \#Edges} & \textbf{\#Class} & \textbf{Metric}  & \textbf{Category} \\
\midrule
MUTAG &188    & 17.93 &	19.79&	2&	Accuracy&	biochemical\\
PTC-MR &344  &14.29  &  14.69&  2&	 Accuracy&	biochemical\\
PROTEINS &1,113	&39.06	&72.82	&2	&Accuracy	&biochemical\\
NCI1 &4,110	&29.87	&32.30	&2	&Accuracy	&biochemical\\
DD &1,178	&284.32	&715.66	&2	&Accuracy	&biochemical\\
COLLAB &5,000	&74.49	&2457.78	&3	&Accuracy	&social network\\
IMDB-B &1,000	&19.77	&96.53	&2	&Accuracy	&social network\\
RDT-B &2,000	&429.63	&497.75	&2	&Accuracy	&social network\\
IMDB-M &1,500	&13.00	&65.94	&3	&Accuracy	&social network\\
ogbg-molhiv &41,127	&25.50 		&27.50	&2	&ROC-AUC	&MoleculeNet\\
\bottomrule
\end{tabular}
\label{tab: tu_data_stat}
\end{table*}

\begin{table*}[h]
\centering
\caption{Statistics of MoleculeNet datasets. }
\vspace{-0.15in}
\begin{tabular}{ccccccc} 
\toprule
  \textbf{Dataset} & \textbf{\#Graphs} & \textbf{Avg \#Nodes} & \textbf{Avg Degree} & \textbf{\#Tasks} & \textbf{Metric}  & \textbf{Category} \\
\midrule
ZINC-2M &2,000,000    & 26.62 &	57.72&	-&	-&	biochemical\\
BBBP &2,039  &24.06  &  51.90&  1&	 ROC-AUC&	biochemical\\
Tox21 &7,813	&18.57	&38.58	&12	&ROC-AUC	&biochemical\\
ToxCast &8,576	&18.78	&38.62	&617	&ROC-AUC	&biochemical\\
SIDER &1,427	&33.64	&70.71	&27	&ROC-AUC	&biochemical\\
ClinTox &1,477	&26.15	&55.76	&2	&ROC-AUC	&biochemical\\
MUV &93,087	&24.23	&52.55	&17	&ROC-AUC	&biochemical\\
HIV &41,127	&25.51	&54.93	&1	&ROC-AUC	&biochemical\\
BACE &1,513	&34.08	&73.71	&1	&ROC-AUC	&biochemical\\
\bottomrule
\end{tabular}
\label{tab: transfer_data_stat}
\end{table*}

All of the eleven datasets are publicly available, we attach their links as follows: 
\begin{itemize}
    \item TU datasets: \url{https://chrsmrrs.github.io/datasets/docs/datasets/}
    \item MoleculeNet datasets: \url{http://snap.stanford.edu/gnn-pretrain/}
    \item ogbg-molhiv dataset: \url{https://ogb.stanford.edu/docs/graphprop/#ogbg-mol} 
\end{itemize}


\section{Implementation Details}
\label{appendix: implement}
All experiments are conducted with the following settings:
\begin{itemize}
    \item Operating System: Ubuntu 18.04.5 LTS
    \item CPU: AMD(R) Ryzen 9 3900x
    \item GPU: NVIDIA GeForce RTX 2080ti
    \item Software: Python 3.8.5; Pytorch 1.10.1; PyTorch Geometric 2.0.4; PyGCL 0.1.2; Numpy 1.20.1; scikit-learn 0.24.1. 
\end{itemize}
We implement our framework with PyTorch and PyGCL library \citep{PyGCL}. 
We choose GIN \citep{GIN} as the backbone graph encoder and the model is optimized through Adam optimizer. We follow \citep{GraphCL, GASSL, DGCL} to employ a linear SVM classifier for downstream task-specific classification. The graph augmentation operations used in our work are the same as \citep{GraphCL}, including node dropping, edge perturbation, attribute masking, and subgraph sampling, all of them are borrowed from the implementation of \citep{PyGCL}.  There are two specific hyper-parameters in our model, namely $\lambda_{r}$ and $\lambda_{a}$, the search space of them are $\left \{0.0, 1.0, 3.0, 5.0, 10.0 \right \}$ and $\left \{0.0, 0.25, 0.5, 0.75, 1.0 \right \}$, respectively. For other important hyper-parameters, we find the best value of learning rate from $\left \{0.01, 0.005, 0.001, 0.0005, 0.0001 \right \}$, embedding dimension from $\left \{32, 64, 128, 256, 512 \right \}$, number of GNN layers from $\left \{2, 3, 4, 5\right \}$, batch size from $\left \{32, 64, 128, 256, 512 \right \}$ (except for ogbg-molhiv $\left \{64, 128, 256, 512, 1024 \right \}$). Besides, we fix the perturbation bound $\epsilon$, ascent step size $\alpha$, and ascent step $T$ as 0.008, 0.008, and 5 during hyper-parameter fine-tuning. For the implementation details of transfer learning, we follow the pre-training setting of previous works~\citep{GraphCL}. 

\vspace{-0.1in}

\section{Proof}
\label{appendix: proof}
\vspace{-0.1in}
\subsection{Proof of Theorem 1}
\label{theorem: objective_appendix}

\textbf{Theorem 1. }
\textit{Suppose $f(\cdot)$ is a GNN encoder as powerful as 1-WL test. Let $g_{p}(\cdot)$ elicits only the augmentation information from $\mathbf{z}$ meanwhile $g_{c}(\cdot)$ extracts the essential factors of $G$ from $\mathbf{z}_{1}$ and $\mathbf{z}_{2}$. Then we have:}
\begin{equation}
I\left(t_{1}(G) ; \mathbf{z}_{2}^{c}, \mathbf{z}_{2}^{p}\right) \geq I\left( \mathbf{z}_{1}^{p} ; \mathbf{z}_{2}^{p}\right) \text { where } G \in \mathcal{G} \text{ and } t_{1}(\cdot), t_{2}(\cdot) \in \mathcal{T}.
\nonumber
\end{equation}

\textbf{Proof.} According to the assumption in Theorem \ref{theorem: objective}, for any two graphs $G, G^{\prime} \in \mathcal{G}$, if $G \cong G^{\prime}$ then we have $\mathbf{z}=\mathbf{z^{\prime}}$, where $\mathbf{z}=f(G)$ and $\mathbf{z^{\prime}}=f(G^{\prime})$. 

Besides, $\mathbf{z}^{p}=g_{p}(\mathbf{z})$ is specific to the predictive factors and $\mathbf{z}^{c}=g_{c}(\mathbf{z})$ is particular to the non-predictive factors, which means $\mathbf{z}^{p}$ and $\mathbf{z}^{c}$ are mutually excluded and $\mathbf{z}^{p} \sim G$. So we have, 
% are learned from the same augmented graph view $t(G)$ and they are disentangled (mutually excluded). thus they are independent and conditional independent:  
\begin{equation}
\begin{gathered}
    p\left(\mathbf{z}^{p}, \mathbf{z}^{c}\right) = p\left(\mathbf{z}^{p}\right)p\left(\mathbf{z}^{c}\right) \\
    p\left(\mathbf{z}^{p}, \mathbf{z}^{c} \mid t(G) \right) = p\left(\mathbf{z}^{p} \mid t(G) \right)p\left(\mathbf{z}^{c} \mid t(G) \right) .
\end{gathered}
\label{eq: independent}
\end{equation}

Then, we want to prove that given three random variables $a$, $b$ and $c$, if they satisfy  $p\left( b, c \right)=p\left( b \right)p\left( c \right)$ and $p\left( b, c \mid a \right)=p\left( b \mid a \right)p\left( c \mid a \right)$, we have $I\left( a, b \mid c \right)=I\left(a, b \right)$. According to the definition of mutual information, we have, 
\begin{equation}
\begin{aligned}
      & I\left(a ; b \mid c\right) =  \\ 
      & \sum_{a} \sum_{b} \sum_{c} p\left(a, b, c\right) \log \frac{p\left(a, b, c\right) p\left(c\right)}{p\left(a, c\right) p\left(b, c\right)}  \\
      & = \sum_{a} \sum_{b} \sum_{c}
      p\left(a\right)p\left(b, c \mid a\right) \log \frac{p\left(b, c \mid a \right) p\left(c\right)}{p\left(c \mid a\right) p\left(b\right)p\left(c\right)} \\ 
      & = \sum_{a} \sum_{b} \sum_{c}
      p\left(a\right)p\left(b \mid a\right) p\left(c \mid a\right) \log \frac{p\left(b \mid a \right)p\left(c \mid a \right)}{p\left(c \mid a\right) p\left(b\right)} \\ & = \sum_{a} \sum_{b}
      p\left(a\right)p\left(b \mid a\right) \log \frac{p\left(b \mid a \right)}{p\left(b\right)} \\ 
      & = \sum_{a} \sum_{b}
      p\left(a, b\right) \log \frac{p\left(b \mid a \right)}{p\left(b\right)} \\ 
      & = I\left(a; b\right) .
\end{aligned}
\label{eq: disen_propo}
\end{equation}

After that, by applying the chain rule to $I\left(t_{1}(G) ; \mathbf{z}_{2}^{p}, \mathbf{z}_{2}^{c}\right)$, we have,
\begin{equation}
\begin{aligned}
    I\left(t_{1}(G) ; \mathbf{z}_{2}^{p}, \mathbf{z}_{2}^{c}\right) 
    & = I\left(t_{1}(G) ; \mathbf{z}_{2}^{p} \mid \mathbf{z}_{2}^{c}\right) + I\left(t_{1}(G) ; \mathbf{z}_{2}^{c}\right) \\
    & \stackrel{(2)}{=} I\left(t_{1}(G) ; \mathbf{z}_{2}^{p} \right) + I\left(t_{1}(G) ; \mathbf{z}_{2}^{c}\right) \\ 
    & \stackrel{(a)}{\geq} I\left(t_{1}(G) ; \mathbf{z}_{2}^{p} \right) \\ 
    & \stackrel{(b)}{\geq} I\left(\mathbf{z}_{1}^{c}, \mathbf{z}_{1}^{p} ; \mathbf{z}_{2}^{p} \right) \\
    & \stackrel{(2)}{=} I\left(\mathbf{z}_{1}^{c}; \mathbf{z}_{2}^{p} \right) + I\left(\mathbf{z}_{1}^{p} ; \mathbf{z}_{2}^{p} \right) \\ 
    & \stackrel{(a)}{\geq} I\left(\mathbf{z}_{1}^{p} ; \mathbf{z}_{2}^{p} \right) ,
\end{aligned}
\end{equation}
where $\stackrel{(2)}{=}$ is derived from the conclusion we get in Equation \ref{eq: disen_propo}, $\stackrel{(a)}{\geq}$ is based on the non-negativity of mutual information, i.e., $I(;) \geq 0$, and $\stackrel{(b)}{\geq}$ is because data processing inequality \citep{DPI}. Finally, we reach to the lower bound of $I\left(t_{1}(G) ; \mathbf{z}_{2}^{p}, \mathbf{z}_{2}^{c}\right)$ in Equation \ref{eq: disen_propo}, thus we can maximize the consistency between the information we capture from the two augmentation graph views by minimizing $\mathcal{L}_{\text{pre}}$. 

\subsection{Proof of Theorem 2}
\label{theorem: disentangle_appendix}

\textbf{Theorem 2. }
\textit{Assume $q$ is a Gaussian distribution, $g_{r}$ is the parameterized reconstruction model which infer $\mathbf{z}_{w}$ from $\left( \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)$. Then we have: }
\begin{equation}
H\left( \mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right) \leq \left\|\mathbf{z}_{w}-  g_{r}\left(\mathbf{z}_{w^{\prime}}^{p} \odot \mathbf{z}_{w}^{c}\right) \right\|_{2}^{2} \text { where } w=w^{\prime} \text{ or } w \neq w^{\prime}.
\nonumber
\end{equation}

\textbf{Proof.} To reconstruct the entangled representation $\mathbf{z}_{w}$ from its corresponding non-predictive representation $\mathbf{z}_{w}^{c}$ and the predictive representation of any augmentation view $\mathbf{z}_{w^{\prime}}^{p}$ ($w$ and $w^{\prime}$ are not necessarily equal), we need to minimize the conditional entropy:
\begin{equation}
    H\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right)=-\mathbb{E}_{p\left(  \mathbf{z}_{w}, \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)}\left[\log p\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right)\right].
\end{equation}
Since the real distribution of $p\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w^{\prime}}^{c}\right)$ is unknown and intractable, we hereby introduce a variational distribution $q\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right)$ to approximate it. Therefore, we have,
\begin{equation}
\begin{aligned}
    \mathbb{E}_{p\left(  \mathbf{z}_{w}, \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)} & \left[\log p\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)\right]  = \\
    & \mathbb{E}_{p\left(  \mathbf{z}_{w}, \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)}\left[\log q\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right)\right] \\
    & + D_{\mathrm{KL}}\left(p\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right) \| q\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right)\right) .
\end{aligned}
\end{equation}
Due to the non-negativity of KL-divergence between any two distributions, it is safe to say $-\mathbb{E}_{p\left(  \mathbf{z}_{w}, \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)}\left[\log q\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right)\right]$ is the upper bound of $H\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right)$. Based on the assumption of Theorem \ref{theorem: disentangle}, let $q\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right)$ being a Gaussian distribution $\mathcal{N}\left(\mathbf{z}_{w} \mid g_{r}\left(\mathbf{z}_{w^{\prime}}^{p} \odot \mathbf{z}_{w}^{c}\right), \sigma^2 \mathbf{I}\right)$, where $g_{r}(\cdot)$ is the reconstruct network that predict $\mathbf{z}_{w}$ from $\left( \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)$ and $\sigma$ is the variance. Thus we have, 
\begin{equation}
\begin{aligned}
    H\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right) 
    & \leq -\mathbb{E}_{p\left(  \mathbf{z}_{w}, \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)}\left[\log q\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right)\right] \\
    & = - \mathbb{E}_{p\left(  \mathbf{z}_{w}, \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)}\left[\log \left(\frac{1}{\sqrt{2 \pi I} \sigma }  e^{ -\frac{1}{2} \frac{ \left( \mathbf{z}_{w}-g_{r}\left(\mathbf{z}_{w^{\prime}}^{p} \odot \mathbf{z}_{w}^{c}\right) \right)^{2}}{(\sigma^{2} \mathbf{I})}} \right) \right] \\ 
    & = - \mathbb{E}_{p\left(  \mathbf{z}_{w}, \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)}\left[ \log \left( \frac{1}{\sqrt{2 \pi I} \sigma } \right) -\frac{\left(\mathbf{z}_{w}-g_{r}\left(\mathbf{z}^{p}_{w^{\prime}} \odot \mathbf{z}^{c}_{w}\right)\right)^{2}}{2 \sigma^{2} \mathbf{I}} \right].
\end{aligned}
\label{eq: disen_upper}
\end{equation}
Hence, we get the upper bound of  $H\left(\mathbf{z}_{w} \mid \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c}\right)$ as Equation \ref{eq: disen_upper}. To minimize the value of the unsolvable entropy, we can instead minimize the value of its upper bound and thereby derive the objective function as follow by neglecting the constant terms,
\begin{equation}
\min \mathbb{E}_{p\left(  \mathbf{z}_{w}, \mathbf{z}_{w^{\prime}}^{p}, \mathbf{z}_{w}^{c} \right)}\left\|\mathbf{z}_{w}-g_{r}\left(\mathbf{z}^{p}_{w^{\prime}} \odot \mathbf{z}^{c}_{w} \right)\right\|_{2}^{2} .
\end{equation}
Since we adopt two augmentation views and propose the cross-view reconstruction mechanism in our method, we can minimize the entropy by minimizing $\mathcal{L}_{\text{recon}}$ and thus guarantee the disentanglement of $\mathbf{z}^{p}$ and $\mathbf{z}^{c}$. 


% \section{Effects of Representation Disentanglement}
% \label{sec: exp_disentangle}
% \begin{figure}[h]
% % %%\vspace{-0.3in}
%   \centering
%   \includegraphics[width=\linewidth]{CL.png}
%   \caption{InfoNCE loss of the two disentangled representations between the two augmentation graph views, where orange lines are the InfoNCE loss between the two non-predictive representations and blue lines are the InfoNCE loss between the two predictive representations} 
%   %%\vspace{-0.2in}
%   \label{fig: CL_Loss}
% \end{figure}

% In this section, we set experiments to investigate the representation disentanglement of our proposed GCVR. Specifically, we use the InfoNCE loss \citep{InfoNCE} to dynamically measure the representation difference between the two augmentation graph views based on the two disentangled representations, where blue lines indicate the InfoNCE loss between $\mathbf{z}^{p}_{1}$ and $\mathbf{z}^{p}_{2}$ and orange lines represent the InfoNCE loss between $\mathbf{z}^{c}_{1}$ and $\mathbf{z}^{c}_{2}$. For simplicity, we only demonstrate the first 100 pre-training epochs of PROTEINS and COLLAB in Figure \ref{fig: CL_Loss}, we can observe similar phenomena on other datasets. From the loss curves in Figure \ref{fig: CL_Loss} we can find that contrastive loss between predictive representations gradually decreases, indicating the predictive representation is optimized to capture all the shared information between the two augmentation views. Meanwhile, we can see contrastive loss between the non-predictive representations achieve a noticeable increase, which is consistent with our expectation that the two independent sampled augmentation operators cause a distribution shift between the two augmentation views.  To further investigate whether the feature suppression problem is equally serious in $\mathbf{z}^{p}$ and $\mathbf{z}^{c}$, we conduct experiments to compare the performance of the two representations on downstream tasks. The comparison results are as follows:


% % \begingroup
% \begin{table*}[h]
% \centering
% \caption{Performance comparison of the two learned representations. Results are reported as mean±std\%, the best performance is bolded.}
% \label{tab: disen}
% % \setlength{\tabcolsep}{3pt}
% % \begin{adjustbox}{width=\textwidth,center}
% \begin{tabular}{ccccccccccc} 
% \toprule
%   & \textbf{MUTAG} & \textbf{COLLAB} & \textbf{NCI1} & \textbf{PROTEINS}  & \textbf{IMDB-B}   & \textbf{RDT-B}  & \textbf{DD} & \textbf{ogbg-molhiv}\\
% \midrule
% $\mathbf{z}^{c}$ &88.1±1.2    & 75.1±0.7&	72.2±2.0&	73.5±0.8&	71.8±0.9&	89.4±1.0&	75.8±0.6&	69.70±2.8\\
% $\mathbf{z}^{p}$&	\textbf{92.3±0.7}&	\textbf{80.5±0.5}&	\textbf{82.0±1.0}&	\textbf{76.8±0.4}&	\textbf{75.6±0.4}&	\textbf{92.5±0.9}&	\textbf{80.5±0.5} &\textbf{75.36±1.4}\\
% \bottomrule
% \end{tabular}
% % \end{adjustbox}
% \end{table*}
% % \endgroup

% It is easy to observe that there is an obvious performance gap between the two learned representations, indicating the different feature suppression issues between them and the features subset that are more robust to augmentation is more informative and transferable than those sensitive to augmentations. Therefore, we believe our proposed GCVR can further alleviate the feature suppression issue with the disentanglement design.  


\section{Impacts of Reconstruction Loss}
\label{appendix: rec_loss}
In this work, we also conduct experiments to compare the effectiveness of different loss computation mechanisms for reconstruction. Except for the mean square error (MSE) loss in Equation \ref{eq: disentanglement}, we also include the scaled cosine error (SCE) loss used in GraphMAE in this experiment. Previous work, like~\cite{rec_loss}, has demonstrated that Mean Squared Error (MSE) loss is sensitive to data scale, meaning its effectiveness can vary significantly with the range of target values. In contrast, the Scaled Cosine Error (SCE) utilized in GraphMAE is scale-insensitive, making it particularly effective in applications where the direction or orientation of vectors is crucial. Consequently, MSE loss is more suitable for regression problems where magnitude really matters, while cosine similarity loss usually can handle classification tasks better since it mainly focuses on the angle between vectors. This suggests a potential opportunity for enhancing our model performance on classification tasks by substituting MSE loss with SCE loss. To further investigate it, we test their effectiveness on four OGB datasets~\cite{OGB}, where two of them (ogbg-molbbp and ogbg-moltox21) are classification tasks and the other two (ogbg-molesol and ogbg-molfreesolv) are regression tasks. The experimental results are shown in Table \ref{tab: rec_loss}, from which we can see the results above also support the conclusion drawn in previous works. We believe the choice between MSE loss and SCE loss depends on the specific requirements of your task and the inherent properties of the data you are working with.

\begin{table}[h]
\centering
\caption{Imapcts of different reconstruction loss computation methods.}
\vspace{-0.15in}
\label{tab: rec_loss}
% \begin{adjustbox}{width=\linewidth,center}
\begin{tabular}{cccccc} 
\toprule
  & \textbf{ogbg-molbbbp} & \textbf{ogbg-molbbbp} & \textbf{ogbg-molesol} & \textbf{ogbg-molfreesolv} \\
\midrule
\textbf{GCVR-MSE} & 70.1±0.8 &	73.3±0.5 &	\textbf{1.112±0.040} &	\textbf{4.032±0.575} \\
\textbf{GCVR-SCE}&	\textbf{70.8±1.2}&	\textbf{74.2±0.5}&	1.225±0.076 &	4.520 ± 0.680 \\
\bottomrule
\end{tabular}
% \end{adjustbox}
\end{table}



\section{Hyper-parameter Sensitivity}
\label{appendix: hyper}
In this section, we study the impacts of some important hyper-parameters in our method, including reconstruction loss coefficient $\lambda_{r}$, adversarial loss coefficient $\lambda_{a}$, embedding dimension $d$, batch size $|\mathcal{B}|$ and number of GNN layers $L$. Here, we select four datasets, i.e., MUTAG, PROTEINS, RDT-B, and COLLAB, to report for simplicity because the four datasets cover different domains and scales. We illustrate the impacts of these hyper-parameters in the figures below. 

\begin{figure*}[ht!]
  \centering
  \includegraphics[width=\linewidth]{lambda_r.pdf}
    \vspace{-0.3in}
  \caption{Impact of reconstruction loss coefficient $\lambda_{r}$ on different datasets, we specify the non-reconstruction situation ($\lambda_{r}=0$) with the dashed line for comparison.} 
  \label{fig: lamr}
\end{figure*}

From the result demonstrated in Figure \ref{fig: lamr}, we can see 
the optimal reconstruction loss coefficient $\lambda_{r}$ is different dependent on the specific dataset, but all the values in our experiment can enhance the performance compared with the non-reconstruction variant, i.e., $\lambda_{r}=0$, indicating the effectiveness of our proposed cross-view reconstruction mechanism. 

\begin{figure*}[ht!]
  \centering
  \includegraphics[width=\linewidth]{lambda_a.pdf}
    \vspace{-0.3in}
  \caption{Impact of adversarial loss coefficient $\lambda_{a}$ xon different datasets, we specify the non-adversarial situation ($\lambda_{a}=0$) with the dashed line for comparison.}
  \label{fig: lama}
\end{figure*}

\begin{figure*}[h!]
  \centering
  \includegraphics[width=\linewidth]{para_com.pdf}
    \vspace{-0.3in}
  \caption{Impact of a embedding dimension $d$ and GNN layer number $L$ on different datasets.} 
  \label{fig: emb_lay}
\end{figure*}

Figure \ref{fig: lama} shows that we could further raise the model performance through adversarial training, which proves a robust representation with less redundant information usually achieve more performance gain compared with the brittle one. During this process, we need to choose an appropriate adversarial loss coefficient $\lambda_{a}$, otherwise a too large $\lambda_{a}$ may hurt the information sufficiency of the learned representation. 
We put the impacts of embedding dimension $d$ and GNN layer number $L$ together because we can find a similar observation from their experimental results. From Figure \ref{fig: emb_lay}, we observe that the optimal values of the two hyper-parameters generally increase as the dataset scale increases. The reason behind this phenomenon could be large datasets usually contain more latent factors than small datasets, therefore a model with a larger capacity is needed to fit the large datasets. However, such a high-capacity message-passing model will deteriorate the performance of a small dataset because it may cause the learned representation to over-smoothing and hence less informative. 



\end{document}
