%\documentclass{article}
\documentclass[accepted]{uai2022}

%\pdfpagewidth=8.5in
%\pdfpageheight=11in

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}

\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{times}
\renewcommand*\ttdefault{txtt}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{graphicx}
\usepackage{algpseudocode}
\usepackage{algorithm}
\usepackage{subcaption,graphicx}
\usepackage{url}
%\usepackage[hidelinks]{hyperref}
\usepackage[utf8]{inputenc}
%\urlstyle{same}
\usepackage{comment}
%\usepackage{hyperref}
\usepackage{xcolor}
\usepackage{mathrsfs}
\usepackage{caption}
\urlstyle{same}
\DeclareCaptionFont{9pt}{\fontsize{9pt}{10pt}\selectfont}
\captionsetup{font={9pt}}

\usepackage{listings}
\lstset{basicstyle=\small\ttfamily,columns=fullflexible}

\renewcommand{\footnotesize}{\fontsize{9pt}{11pt}\selectfont}


\newcommand{\mvec}{\operatorname{vec}\diagfences}

% Francesco start
\newcommand{\blambda}{\boldsymbol{\lambda}}
\newcommand{\bv}{\boldsymbol{v}}
\usepackage{soul} %used for \st{text}
\usepackage{bm}
% Francesco end



\usepackage{booktabs}
\newcommand{\ra}[1]{\renewcommand{\arraystretch}{#1}}

\definecolor{myyellow}{rgb}{0.9290 0.6940 0.1250}
\definecolor{Gray}{gray}{0.9}

\newcommand{\commentR}[1]{\ifcomments \textbf{\textcolor{magenta}{RJ: #1}} \fi}

\title{Principle of Relevant Information for Graph Sparsification}

\author[1]{\href{mailto:yusj9011@gmail.com?Subject=Your UAI 2022 paper}{Shujian~Yu}{}}
\author[2]{\href{mailto:francesco.alesiani@neclab.eu?Subject=Your UAI 2022 paper}{Francesco Alesiani}{}}
\author[3]{Wenzhe~Yin}
\author[1,5,6]{Robert~Jenssen}
\author[4]{Jose~C.~Principe}
% Add affiliations after the authors
\affil[1]{%
    %Machine Learning Group\\
    UiT - The Arctic University of Norway\\
    Norway
}
\affil[2]{%
    %dd\\
    NEC Laboratories Europe\\
    Germany
}
\affil[3]{%
    %Informatics Institute\\
    University of Amsterdam\\
    Netherlands
  }
\affil[4]{%
    %Department of Electrical and Computer Engineering\\
    University of Florida\\
    USA
  }
\affil[5]{%
    Norwegian Computing Center\\
    Norway
  }
\affil[6]{%
    University of Copenhagen\\
    Denmark
  }

\usepackage{mathtools}
\DeclareMathOperator{\Mat}{Mat}
\DeclareMathOperator{\diag}{diag}
\DeclarePairedDelimiter{\diagfences}{(}{)}
\newtheorem{theorem}{Theorem}
%\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}[theorem]{Definition}
%\newtheorem{property}[theorem]{Property}
\newtheorem{conjecture}[theorem]{Conjecture}
%\newcommand{\diag}{\operatorname{diag}\diagfences}
\newcommand{\tr}{\operatorname{tr}\diagfences}
\newcommand{\Be}{\operatorname{Bernoulli}\diagfences}
\renewcommand{\algorithmicrequire}{\textbf{Input:}}
\renewcommand{\algorithmicensure}{\textbf{Output:}}
\DeclareMathOperator{\E}{\mathbb{E}}
\DeclareMathOperator{\Tr}{tr}
\DeclareMathOperator{\range}{range}
%\DeclareMathOperator*{\min}{min} 
\DeclareMathOperator*{\argmin}{argmin} % thin space, limits underneath in displays
\DeclareMathOperator{\Sigmoid}{Sigmoid}

\newtheorem{property}{Property}[section]
\newtheorem{corollary}{Corollary}
\newtheorem{assumption}{Assumption}
\newtheorem{lemma}{Lemma}

%\usepackage[ruled, noend, noline]{algorithm2e}
%\SetKwInOut{Parameter}{parameter}
\usepackage{xcolor}

\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.92}

\lstdefinestyle{mystyle}{
    backgroundcolor=\color{backcolour},   
    commentstyle=\color{codegreen},
    keywordstyle=\color{magenta},
    numberstyle=\tiny\color{codegray},
    stringstyle=\color{codepurple},
    basicstyle=\ttfamily\footnotesize,
    breakatwhitespace=false,         
    breaklines=true,                 
    captionpos=b,                    
    keepspaces=true,                 
    numbers=left,                    
    numbersep=5pt,                  
    showspaces=false,                
    showstringspaces=false,
    showtabs=false,                  
    tabsize=2
}

\lstset{style=mystyle}

% Francesco (start)
\usepackage{soul} %used for \st{text}
% https://tex.stackexchange.com/questions/65453/track-changes-in-latex
%\usepackage{changes} % used to track changes
% Francesco (end)

%\setlength {\marginparwidth }{2cm}

\begin{document}
	
\maketitle

\begin{abstract}
%We present a principled information-theoretic approach for graph sparsification, i.e., reducing edges such that the overall complexity of the graph is lowered while certain essential properties of the graph is still maintained. Our objective is inspired by the Principle of Relevant Information (PRI), 

%Graph sparsification aims to reduce the number of edges of a graph while maintaining its certain structural properties. In this paper, we propose the first general and effective information-theoretic formulation of graph sparsification, by taking inspiration from the Principle of Relevant Information (PRI). To this end, we extend the PRI from a standard $i.i.d.$ random variable setting to structured data (i.e., graphs). Our Graph-PRI objective is achieved by operating on the graph Laplacian spectrum, made possible by expressing the graph Laplacian of a subgraph in terms of a sparse edge selection vector $\mathbf{w}$. We provide both theoretical and empirical justifications on the validity of our Graph-PRI. We also analyze its analytical solutions in a few special cases. We finally present three real-world applications, namely graph visualization, graph classification and graph-regularized multi-task learning, to demonstrate the versatility and advantages of our approach over prevalent graph sparsifiers.

% Graph sparsification aims to reduce the number of edges of a graph while maintaining its structural properties. In this paper, we propose the first general and effective information-theoretic formulation of graph sparsification, by taking inspiration from the Principle of Relevant Information (PRI). To this end, we extend the PRI from a standard scalar random variable {\color{blue} \st{setting} } to structured data (i.e., graphs). 
% \replaced{Our {\it Graph-PRI} objective operates over graph Laplacian, while graph Laplacian of the subgraph is defined in terms of a sparse edge selection vector $\mathbf{w}$.
% }{Our Graph-PRI objective is achieved by operating on the graph Laplacian, made possible by expressing the graph Laplacian of a subgraph in terms of a sparse edge selection vector $\mathbf{w}$.}
% We provide both theoretical and empirical justifications on the validity of our Graph-PRI. We also analyze its analytical solutions in a few special cases. We finally present three representative real-world applications, namely graph sparsification, graph regularized multi-task learning, and medical imaging-derived brain network classification, to demonstrate the effectiveness, the versatility and the enhanced interpretability of our approach over prevalent sparsification techniques. 


Graph sparsification aims to reduce the number of edges of a graph while maintaining its structural properties. In this paper, we propose the first general and effective information-theoretic formulation of graph sparsification, by taking inspiration from the Principle of Relevant Information (PRI). To this end, we extend the PRI from a standard scalar random variable setting to structured data (i.e., graphs). 
Our Graph-PRI objective is achieved by operating on the graph Laplacian, made possible by expressing the graph Laplacian of a subgraph in terms of a sparse edge selection vector $\mathbf{w}$.
We provide both theoretical and empirical justifications on the validity of our Graph-PRI approach. We also analyze its analytical solutions in a few special cases. We finally present three representative real-world applications, namely graph sparsification, graph regularized multi-task learning, and medical imaging-derived brain network classification, to demonstrate the effectiveness, the versatility and the enhanced interpretability of our approach over prevalent sparsification techniques. Code of Graph-PRI is available at \url{https://github.com/SJYuCNEL/PRI-Graphs}.


%\commentR{Link to Github?}

% We present two ways to optimize $\mathbf{w}$ by gradient-based method. 
\end{abstract}

\graphicspath{{figures/}}

\section{Introduction}
%Graphs and networks are used in numerous machine learning and data mining applications to describe relationships between entities, such as social relations between persons, links between web pages, flow of traffic, or interactions between proteins. 
Many complex structures and phenomena are naturally described as graphs and networks (e.g., social networks, brain functional connectivity~\citep{zhou2020toolbox}, climate causal effect network~\citep{nowack2020causal}, etc.). However, it is challenging to exactly visualize and analyze a graph even with moderate size due to the quadratic growth in the number of edges. Therefore, techniques to sparsify graphs by pruning less informative edges have gained increasing attention in the last two decades~\citep{spielman2011graph,bravo2019unifying,wu2020graph}.
Apart from offering a much easier visualization, graph sparsification can be used in multiple ways. For example, it may reduce the storage space and accelerate the running time of machine learning algorithms involving graph regularization, with negligible accuracy loss~\citep{sadhanala2016graph}. When differentiable privacy is a major concern, sparsity can remove or hide edges for the purpose of information protection~\citep{arora2019differentially}.

%  traffic flow over different sensors~\citep{chen2021trafficstream}

On the other hand, there is a recent trend to leverage  information-theoretic concepts and principles to problems related to graphs or graph neural networks. 
Let $\mathcal{X}$ denote the graph input data which may encode both graph structure information (characterized by either adjacency matrix $A$ or graph Laplacian $L$) and node attributes, and $Y$ the desired response such as node labels or graph labels.  
A notable example is the famed Information Bottleneck (IB) approach~\citep{tishby99information}, which formulates the learning as: 
\begin{equation}\label{eq:IB_Lagrangian}
    \mathcal{L}_{\text{IB}}=\min I(\mathcal{X};T) - \beta I(Y;T),
\end{equation}
in which $I(\cdot;\cdot)$ denotes the mutual information.
$T$ is the object we want to learn or infer from $\{\mathcal{X},Y\}$
% . It can be 
{\color{black} that can be used as graph node representation~\citep{wu2020graphIB} or as the most infromative and interpretable subgraph with respect to the label $Y$~\citep{yu2020graph}. }
% or an interpretable subgraph that is most informative to label $Y$~\citep{yu2020graph}. 
$\beta$ is a Lagrange multiplier that controls the trade-off between the \textbf{sufficiency} (the performance of $T$ on down-stream task, as quantified by $I(Y;T)$) and the \textbf{minimality} (the complexity of the representation, as measured by $I(\mathcal{X};T)$).

%which has been used for graph node representation learning~\cite{wu2020graphIB} or identifying the most interpretable subgraph with respect to a certain decision (e.g., graph labels)~\cite{yu2020graph}, 

Instead of using the IB approach, we explore the feasibility and power in graph data of another less well-known information-theoretic principle
% {\color{blue} When labels $Y$ are not available, IB is not directly applicable, we thus explore } the feasibility and potency of another less well-known information-theoretic principle 
- the Principle of Relevant Information (PRI)~\cite[Chapter~8]{principe2010information}, which exploits self organization, requiring only a single random variable $\mathcal{X}$. Different from IB that requires an auxiliary relevant variable $Y$ and possibly the joint distribution of $\mathbb{P}(\mathcal{X},Y)$, the PRI is fully unsupervised and aims to obtain a reduced statistical representation $T$ by decomposing $\mathcal{X}$ with:
\begin{equation}\label{eq_PRI_obj}
    \mathcal{L}_{\text{PRI}} = \min H(T) + \beta D(\mathbb{P}(\mathcal{X})\|\mathbb{P}(T)),
\end{equation}
where $H(T)$ refers to the entropy of $T$. The minimization of entropy can be viewed as a means of reducing uncertainty and finding the statistical \textbf{regularity} in $T$. $D(\mathbb{P}(\mathcal{X})\|\mathbb{P}(T))$ is the divergence between the distributions of $\mathcal{X}$ (i.e., $\mathbb{P}(\mathcal{X})$) and $T$ (i.e., $\mathbb{P}(T)$), which quantifies the \textbf{descriptive power} of $T$ about $\mathcal{X}$.



%On the other hand, as a re-emerged information-theoretic principle, the principle of relevant information (PRI)~\cite[Chapter~8]{principe2010information} has been recently applied to various machine learning applications, including training an autoencoder for reliable transmission of signals over noisy channels~\cite{yu2017autoencoders} and extracting spatial-temporal features for hyperspectral image classification~\cite{wei2019multiscale}.

%PRI was initiated in 2008 as a mode decomposition framework. Similar to many other existing information-theoretic principles like the famed Information Bottleneck (IB) approach~\cite{tishby99information}, the utility of PRI was severely limited by its probabilistic nature over independent and identically distributed ($i.i.d.$) samples, which cannot be directly applied to structure data like graphs.

%In this work, we present an initial investigation on extending PRI to graphs. To this end, instead of estimating probability distributions on $i.i.d.$ samples as a prerequisite, the novel formulation of PRI is defined on the cone of symmetric positive semidefinite (SPS) matrices. We then show that the novel matrix-based PRI enables directly application to graph, because the graph Laplacian is SPS and essentially describes a graph.


%Different from these earlier efforts, we present a principled way to sparsify a graph from an information-theoretic perspective.

% \textcolor{blue}{
%To the best of knowledge, the \emph{uncertain graph sparsification} (UGS)~\cite{parchas2018uncertain} is the only methodology that also attempts to bridge the gap between information theory and graph sparsification. However, UGS just assigns a probability over the existence of edges and simply applies the basic Shannon discrete entropy to provide a high-level gauge on the information content of a given graph. By contrast, our methodology operates on the eigenspace of graph Laplacian and avoids probability estimation.
% }

%concentrates on the graph Laplacian which essentially describes a graph. Moreover, different from most of existing information-theoretic learning methodologies (e.g., [?]) that relies heavily on probability distribution estimation, our methodology avoids probability space (and probability estimation) and is operated directly on the eigenspace of valid positive semidefinite matrices.}


So far, PRI has only been used in a standard scalar random variable setting. Recent applications of PRI include, but are not limited to, 
selecting the most relevant examples from the majority class in imbalanced classification~\citep{hoyos2021relevant},
and learning disentengled representations with variational autoencoder~\citep{li2020pri}. 
Usually, one uses the $2$-order R{\'e}nyi's entropy~\citep{renyi1961measures} to quantify $H(T)$ and the Cauchy-Schwarz (CS) divergence~\citep{jenssen2006cauchy} to quantify $D(\mathbb{P}(X)\|\mathbb{P}(T))$ for ease of optimization. 

% extracting spectral-spatial features from hyperspectral images~\citep{wei2021multiscale}, 

In this paper, we extend PRI to graph data. This is not a trivial task, as the R{\'e}nyi's quadratic entropy and CS divergence are defined over probability space and do not capture any structure information. We also exemplify our Graph-PRI with an application in graph sparsification.
To summarize, our contributions are fourfold:
\begin{itemize}
    \item Taking the graph Laplacian as the input variable, we propose a new information-theoretic formulation for graph sparsification, by taking inspiration from PRI.
    \item We provide theoretical and empirical justifications to the objective of Graph-PRI for sparsification. We also analyze the analytical solutions in some special cases of hyperparameter $\beta$. 
    \item We demonstrate that the graph Laplacian of the resulting subgraph can be elegantly expressed in terms of a sparse edge selection vector $\mathbf{w}$, which significantly simplify the learning argument of Graph-PRI. We also show that the objective of Graph-PRI is differentiable, which further simplifies the optimization.
    %\item We show that the objective of Graph-PRI is differentiable with respect to graph Laplacian. Using the reparameterization trick, the optimization can be further simplified. 
    \item Experimental results on graph sparsification, graph-regularized multi-task learning, and brain network classification demonstrate the versatility and compelling performance of Graph-PRI.
\end{itemize}


\section{Preliminary Knowledge}
%Before presenting our statistic, we first briefly introduce background knowledge regarding the Bregman matrix divergence and the correntropy function.

\subsection{Problem Definition and Notations}\label{sec:problem}
Consider an undirected graph $G =(V,E)$ with a set of nodes $V =\{v_1,\cdots,v_N\}$ and a set of edges $E = \{e_1,\cdots,e_M\}$ which reveals the connections between nodes. The objective of graph sparsification is to preferentially retain a small subset of edges from $G$ to obtain a sparsified surrogate graph $G_s=(V,E_s)$ with the edge set $E_s \subset E$ such that $|E_s|\ll M$~\citep{spielman2011graph,hamann2016structure}. 

\textcolor{black}{Alternatively to graph sparsification, it is also possible to reduce the nodes of the graph, which is called graph coarsening. Recent examples on graph coarsening include~\citep{loukas2019graph,cai2021graph}. Recently, \citep{bravo2019unifying} provides a unified framework for graph sparsification and graph coarsening. In this work, we only focus on graph sparsification.}

% $E \subseteq V\times V$

% $G =(V,E)$ by $L\in \mathcal{L}$, where $\mathcal{L}$ is the space of all graph Laplacian: $\mathcal{L}=\left\{L\in S_+^T:\ (\forall i\neq j)\ L_{ij}\le0,\ L_{ii}=-\sum_{j\neq i}L_{ij}\right\}$. 

The topology of $G$ is essentially determined by its graph Laplacian $L=D-A$, where $A$ is the adjacency matrix and $D = \diag{(\mathbf{d})}$ is the diagonal matrix formed by the degrees of the vertices $d_i = \sum_{j=1}^N A_{ij}$.
Consider an \underline{arbitrary} orientation of edges of $G$, 
the incidence matrix $B=[\mathbf{b}_1,\cdots,\mathbf{b}_M]$ of $G$ is a $N\times M$ matrix whose entries is given by:
\begin{equation}
    [\mathbf{b}_m]_i = \begin{cases}
 +1 &\text{if node $v_i$ is the head of edge $e_m$}
 \\ -1 &\text{if node $v_i$ is the tail of edge $e_m$}
 \\ 0 &\text{otherwise}
\end{cases}.
\end{equation}

Mathematically, $L$ can be expressed in terms of $B$ as:
\begin{equation}
    L = BB^T=\sum_{m=1}^M \mathbf{b}_m \mathbf{b}_m^T.
\end{equation}

% $L$ can be expressed in terms of the so-called incidence matrix $M=[\mathbf{a}_1,\cdots,\mathbf{a}_M]\in \mathbb{R}^{N\times M}$ as:
% \begin{equation}
% L=MM^T=\sum_{m=1}^M \mathbf{a}_m \mathbf{a}_m^T,
% \end{equation}
% where $\mathbf{a}_m$ denotes a length-$N$ edge vector for the $m$-th edge that connects node $i$ with node $j$, with entries $[\mathbf{a}_m]_i=1$, $[\mathbf{a}_m]_j=-1$ and zeros elsewhere.

%$L$ can be expressed in terms of $B$

%Let us now denote the subgraph $G_s(V,E_s)$ with the edge set $E_s\in E$ such that $|E_s|=K\ll M$. 
Suppose the subgraph $G_s$ contains $K$ edges, one can obtain $G_s$ from $G$ through an edge selection vector $\mathbf{w}=[w_1,\cdots,w_M]^T\in \{0,1\}^M$. Here, $\|\mathbf{w}\|_0=K$, $w_m=1$ if the $m$-th edge belongs to the edge subset $E_s$, and $w_m=0$ otherwise. Finally, one can write the graph Laplacian $L_s$ of the subgraph $G_s$ as a function of $\mathbf{w}$ by the following formula:
\begin{equation}\label{eq:k-sparse}
L_s(\mathbf{w}) = \sum_{m=1}^M w_m \mathbf{b}_m \mathbf{b}_m^T = B\diag(\mathbf{w})B^T,
\end{equation}
in which $\diag{(\mathbf{w})}\in \mathbb{R}^{M\times M}$ is a square diagonal matrix with $\mathbf{w}$ on the main diagonal.

Note that, Eq.~(\ref{eq:k-sparse}) also applies to weighted graph $G=(V,W)$ by a proper reformulation of the incidence matrix $B$ as:
\begin{equation}
    [\mathbf{b}_m]_i = \begin{cases}
 +\sqrt{\mu_m} &\text{if node $v_i$ is the head of edge $e_m$}
 \\ -\sqrt{\mu_m} &\text{if node $v_i$ is the tail of edge $e_m$}
 \\ 0 &\text{otherwise}
\end{cases},
\end{equation}
in which $\mu_m$ is the weight of edge $e_m$.

In what follows, we will design a learning-based approach to optimally obtain the edge selection vector $\mathbf{w}$ by making use of the general idea of PRI.

%to sparse a graph (remove redundant edges) but reserve its certain essential properties.


%\subsection{Principle of Relevant Information}
%The principle of relevant information (PRI)~\cite[Chapter~8]{principe2010information} is an information-theoretic principle that aims to perform mode decomposition of a random variable $X$ with a known (and fixed) probability distribution $g$.

\begin{comment}
The PRI was originally proposed to decompose a random variable $X$ with a know and fixed probability distribution $\mathbb{P}(X)$. Suppose we aim to obtain another reduced statistical representation characterized by a random variable $T$ with probability distribution $\mathbb{Q}(T)$. The PRI casts this problem as a trade-off between the entropy of $T$ and its descriptive power about $X$ in terms of their divergence $D(\mathbb{P}(X)\|\mathbb{Q}(T))$:
\begin{equation}\label{eq_PRI_obj}
    \mathcal{L}_{\text{PRI}} = \min H(T) + \beta D(\mathbb{P}(\mathcal{X})\|\mathbb{Q}(T)).
\end{equation}
\end{comment}

% The PRI is similar in spirit to IB approach~\cite{tishby99information}. But the formulation is different here, because PRI does not require a relevant auxiliary variable $Y$ and the optimization is done directly on $X$ (rather than $\mathbb{P}(X,Y)$).

% where $\gamma$ is a hyper-parameter controlling the amount of relevant information that $Y$ can extract from $X$. The minimization of entropy can be viewed as a means of reducing uncertainty (or redundancy) and finding the statistical regularities in the outcomes of a process, whereas the minimization of information divergence ensures that such regularities are closely related to $X$. 

% \begin{equation}\label{eq_PRI_obj}
%     \mathcal{J}_{\text{PRI}} = \argmin\limits_{f}\mathbf{H}(f) + \gamma D(f\|g),
% \end{equation}

%Note that the choice of entropy and divergence is application-specific and depends mostly on the simplicity of optimization. 

\begin{comment}
So far, PRI has only been used for $i.i.d.$ data. Recent applications of PRI includes learning spectral-spatial features for hyperspectral image classification~\cite{wei2021multiscale} and inferring disentengled representations with variational autoencoder~\cite{li2020pri}. Usually, one uses the $2$-order R{\'e}nyi's quadratic entropy~\cite{renyi1961measures} to quantify $H(T)$ and the Cauchy-Schwarz (CS) divergence~\cite{jenssen2006cauchy} to quantify $D(\mathbb{P}(X)\|\mathbb{Q}(T))$. The resulting objective reduces\footnote{Details on CS divergence and PRI are in the supp. material.}:
\begin{equation}\label{eq:pri_renyi}
\mathcal{L}_{\text{PRI}} = \min (1-\beta)H_2(T) + 2\beta H_2(X;T),
\end{equation}
%\end{small}
where $H_2(T)=-\log \int q^2(x)dx$ and $H_2(X;T)=-\log \int p(x)q(x)dx$ (which is also called the quadratic cross entropy~\cite{principe2010information}).
\end{comment}

% \begin{equation}\label{eq:pri_renyi}
% \begin{aligned}
% J(f)&=\arg \min_f H_2(f)+\beta D_{CS}(f||g)\\
% & =\arg \min_f H_2(f)+\beta(2H_2(f;g)-H_2(f)-H_2(g))\\
% & \equiv \arg \min_f (1-\beta)H_2(f) + 2\beta H_2(f;g),
% \end{aligned}
% \end{equation}

%In what follows, we will show how to extend PRI from $i.i.d.$ data to graphs, which we term Graph-PRI. This is not a trivial task, as the R{\'e}nyi's quadratic entropy and CS divergence are defined over probability space and do not capture any structure information. Our Graph-PRI is then exemplified with application in graph sparsification, in which we will design a learning-based approach to optimally obtain the edge selection vector $\mathbf{w}$. 

% \footnote{The CS divergence is motivated by the famed CS inequality:
% \begin{equation}
% \Big| \int f(x)g(x)dx \Big|^2 \leq \int \mid f(x)\mid^2 dx \int \mid g(x)\mid^2 dx,
% \end{equation}
% with equality if and only if $f(x)$ and $g(x)$ are linear independent, a measure of the ``distance'' between the PDFs can therefore be defined as:
% \begin{equation} \label{1.2}
% D_{cs} (f\|g) = -\log(\int fg)^2 + \log(\int f^2) + \log(\int g^2).
% \end{equation}}


\begin{comment}
\begin{figure}[ht]
  \begin{subfigure}[]{0.15\textwidth}
    \includegraphics[width=\textwidth]{original_update}
    \caption{3d Gaussian}
    % \label{}
  \end{subfigure}
  %
  \begin{subfigure}[]{0.15\textwidth}
    \includegraphics[width=\textwidth]{PRI_0_update}
    \caption{$\beta=0$}
    % \label{}
  \end{subfigure}
  %
    \begin{subfigure}[]{0.15\textwidth}
    \includegraphics[width=\textwidth]{PRI_1_update}
    \caption{$\beta=1$}
    % \label{}
  \end{subfigure} \\
  %
    \begin{subfigure}[]{0.15\textwidth}
    \includegraphics[width=\textwidth]{PRI_2_update}
    \caption{$\beta=2$}
    % \label{}
  \end{subfigure}
  %
    \begin{subfigure}[]{0.15\textwidth}
    \includegraphics[width=\textwidth]{PRI_5_update}
    \caption{$\beta=5$}
    % \label{}
  \end{subfigure}
  %
  \begin{subfigure}[]{0.15\textwidth}
    \includegraphics[width=\textwidth]{PRI_100_update}
    \caption{$\beta=100$}
    % \label{}
  \end{subfigure}
  \caption{Illustration of the structures revealed by the PRI for (a) a 3d isotropic Gaussian. As the values of $\beta$ increase, the solution passes through (b) a single point, (c) mode, (d) principal curves, (e) principal surfaces, and in the extreme case of (f) $\gamma\rightarrow\infty$ we get back the data themselves as the solution.}
  %\vspace{-0.7em}
  \label{fig:pri_demo}
\end{figure}
\end{comment}



{\color{black}
\subsection{Graph Sparsification}

% We refer to interested readers to [1] for a comprehensive survey in this topic. 

% \commentR{$\sigma$ here is a constant, right? Later you denote a matrix by $\sigma$.}

%Substantial efforts have been made on graph sparsification in the last decades. In general, existing methods can be roughly divided into two categories~\citep{wu2020graph}: 1) graph property preserving sparsifiers; and 2) application-oriented sparsifiers. The first category sparsifies a graph by preserving its main properties, such as shortest path distances, graph cuts~\citep{benczur2015randomized}, or graph Laplacian. 

Substantial efforts have been made on graph sparsification. In general, existing methods are mostly based on sampling~\cite{fung2019general,wickman2021sparrl}, in which the importance of edges can be evaluated by effective resistance~\citep{spielman2011graph,spielman2011spectral}, degree of neighboring nodes~\citep{hamann2016structure} or local similarity~\citep{satuluri2011local}. Among them, the most notable example is the spectrum-preserving approach that generates a $\gamma$-spectral approximation to $G$ such that:
\begin{equation}\label{eq:spectral_property}
    \frac{1}{\gamma} \vec{x}^T L_s \vec{x} \leq \vec{x}^T L \vec{x} \leq \gamma \vec{x}^T L_s \vec{x} \quad \text{for all} \quad \vec{x}.
\end{equation}
Remarkably, Spielman \emph{et al.} also proved that every graph $G$ has an $(1+\epsilon)$-spectral approximation $G_s$ with nearly $\mathcal{O}(\frac{N}{\epsilon^2})$ edges.


%By contrast, the second category targets specific down-stream applications, such as community detection~\citep{satuluri2011local}, link prediction~\citep{chen2015ensemble}. Our method belongs to the first category. 


%  

%Remarkably, authors also proved that for every weighted graph $G$ and every $\epsilon$, there exists a weighted graph $G_s$ with at most $\frac{(N-1)}{\epsilon^2}$ edges which is an $\frac{1+\epsilon^2}{1-\epsilon^2}$ approximation to $G$. 




% and influence maximization~\citep{mathioudakis2011sparsification}


%the electrical equivalent of the graph and assumes the sampling probability over $\text{edge}(u;v)$ is proportional to the amount of current that flows through this edge when a unit voltage difference is applied to nodes $u$ and $v$. 

%In session \ref{sec:experiments}, we will also demonstrate the versatility of our approach in different applications.

%such as \cite{spielman2011graph,spielman2011spectral} and its recent extensions for big graphs~\cite{imre2020spectrum}. This approach generates the electrical equivalent of the graph and assumes the sampling probability over $\text{edge}(u;v)$ is proportional to the amount of current that flows through this edge when a unit voltage difference is applied to nodes $u$ and $v$.



% For example, the underlying principle behind L-Spar is to retain edges between nodes with similar neighbors for improved clustering performance. Our method belongs to the first category. In session \ref{sec:experiments}, we will also demonstrate the utilities of our approach to different down-stream applications.

%In recent years, learning-based algorithms have gained popularity in different machine learning tasks. However, the problem of graph sparsification is still less-investigated. The aforementioned GSGAN is mainly used for community detection. However, GSGAN could generate new edges that are not existed in the original graph, which may complicate the problem of graph sparsification in practice. On the other hand, SparRL~\citep{wickman2021sparrl} uses the deep reinforcement learning to sequentially prune edges from the original graph by preserving the subgraph modularity. Different from GSGAN and SparRL, we demonstrate, in this work, that a sparsified graph can be learned simply by a gradient-based method in a principled (information-theoretic) manner, avoiding the necessity of reinforcement learning or the tuning of a generative adversarial network (GAN)~\citep{goodfellow2014generative}.


\textcolor{black}{Learning-based approach (especially which uses neural networks) for graph sparsification, in which there is an explicit learning objective and can be directed optimized, is still less-investigated.} GSGAN~\citep{wu2020graph} is designed mainly for community detection, whereas SparRL~\citep{wickman2021sparrl} uses deep reinforcement learning to sequentially prune edges by preserving the subgraph modularity. Different from GSGAN and SparRL, we demonstrate below that a sparsified graph can be learned simply by a gradient-based method in a principled (information-theoretic) manner, avoiding the necessity of reinforcement learning or the tuning of a generative adversarial network (GAN)~\citep{goodfellow2014generative}.

% \cite{wan2021graph} integrates graph sparsification with node classification and formulates both problems using meta-learning with a bi-level objective.

% meta-learning which usually involves the tuning of neural networks or even

}


\section{PRI for Graph Sparsification}


\subsection{The Learning Objective}
Suppose we are given a graph $G$ with a known but fixed topology that is characterized by its graph Laplacian $\rho$, from which we want to obtain a surrogate subgraph $G_s$ with graph Laplacian $\sigma$, by preferentially removing less informative (or redundant) edges in $G$. Motivated by the objective of PRI in Eq.~(\ref{eq_PRI_obj}), we can cast this problem as a trade-off between the entropy $S(\sigma)$ of $G_s$ and its descriptive power about $G$ in terms of their divergence (or dissimilarity) $D(\sigma||\rho)$:
\begin{equation} \label{eq:pri_graph}
\mathcal{J}_{\text{Graph-PRI}} = \arg \min_{\sigma} S(\sigma) + \beta D(\sigma||\rho),
\end{equation}

% We propose the use of the von Neumann entropy on graph defined as
In this paper, we choose von Neumann entropy on the trace normalized graph Laplacian (i.e., $\tilde{\sigma} = \sigma/\tr{\sigma}$ to quantify the entropy of $G_s$, which is defined on the cone of symmetric positive semi-definite (SPS) matrix with trace $1$ as~\citep{nielsen2002quantum}:
\begin{equation} \label{eq:vn}
S_{\text{vN}}(\tilde{\sigma}) = -\tr{\tilde{\sigma} \log\tilde{\sigma}}=-\sum_{i} \left( \lambda_i \log{\lambda_i}  \right),
\end{equation}
in which $\log(\cdot)$ is the matrix logarithm, $\tr{\cdot}$ denotes the trace, $\{\lambda_i\}$ are the eigenvalues of $\tilde{\sigma}$.

% \footnote{A density matrix is a positive semidefinite and symmetric matrix with trace $1$ in quantum physics. For graph, it can be a graph Laplacian or a diffusion operator, \textcolor{blue}{as will be described later}.}

% The Jessen-Shannon divergence based on entropy of the Eq.(\ref{eq:vn})
%With the von Neumann entropy to hand, the quantum Jensen-Shannon (JS) divergence between two graph Laplacian $\sigma$ and $\rho$ is given by:

We then use the quantum Jenssen-Shannon (QJS) divergence between two trace normalized graph Laplacians $\tilde{\sigma}$ and $\tilde{\rho}$ to quantify the divergence between $G$ and $G_s$~\citep{lamberti2008metric}:
\begin{equation} \label{eq:jsd}
D_{\text{QJS}}(\tilde{\sigma}||\tilde{\rho}) = S_{\text{vN}}\left(\frac{\tilde{\sigma} + \tilde{\rho}}{2}\right) - \frac{1}{2} S_{\text{vN}}(\tilde{\sigma}) - \frac{1}{2} S_{\text{vN}}(\tilde{\rho}).
\end{equation}
% We adopt this definition to the divergence Von Neuman divergence defined

In this paper, we absorb a scaling constant $2$ into the expression for $D_{\text{QJS}}(\tilde{\sigma}||\tilde{\rho})$, the resulting objective combining Eqs.~(\ref{eq:pri_graph})-(\ref{eq:jsd}) is given by:
% \begin{equation}
% \label{eq:pri_final}
% \begin{aligned}
\begin{align}
\label{eq:pri_final}
\mathcal{J}_{\text{Graph-PRI}} & = \arg \min S_{\text{vN}}(\tilde{\sigma}) + \beta D_{\text{QJS}}(\tilde{\sigma}||\tilde{\rho}) \\ \nonumber
& = \arg \min S_{\text{vN}}(\tilde{\sigma}) \\ \nonumber
& + \beta \left[ 2 S_{\text{vN}}\left(\frac{\tilde{\sigma} + \tilde{\rho}}{2}\right) - S_{\text{vN}}(\tilde{\sigma}) - S_{\text{vN}}(\tilde{\rho}) \right] \\ \nonumber  
& \equiv \arg \min (1-\beta)S_{\text{vN}}(\tilde{\sigma}) + 2\beta S_{\text{vN}}\left(\frac{\tilde{\sigma} + \tilde{\rho}}{2}\right).
\end{align}
% \end{aligned}
% \end{equation}
%The second equation holds for the fact that the extra term $\beta S_{\text{vN}}(\tilde{\rho})$ is a constant with respect to $\tilde{\sigma}$. 
We remove an extra term $-\beta S_{\text{vN}}(\tilde{\rho})$ in the last line of Eq.~(\ref{eq:pri_final}), because it is a constant value with respect to $\tilde{\sigma}$.

%Obviously, we can observe a strong resemblance to the classical setting of PRI under Renyi's entropy funcationl (i.e., Eq.~(\ref{eq:pri_renyi})).

% & = \arg \min_{\sigma} S(\sigma) + \beta (2S\left(\frac{\sigma + \rho}{2}\right) - S(\sigma) - S(\rho)) \\


% Note that the graph Laplacian is a singular matrix (there is always a zero eigenvalue), which means that there is no matrix exponential or logarithm. An equivalent expressions for $S(\sigma)$ that avoids matrix exponential or logarithm is given by:
% \begin{equation}\label{eq:VN_entropy}
% S(\sigma) = -\tr {\sigma\log{\sigma} - \sigma} = -\sum_{i}  \left( \lambda_i \log{\lambda_i} -\lambda_i  \right),
% \end{equation}
% in which $\{\lambda_i\}$ are the eigenvalues of $\sigma$.


%Note that, an alternative selection to $D(\sigma||\rho)$ is the von Neumann divergence defined as $D(\sigma||\rho) = \tr{\sigma \log_2\sigma - \sigma \log_2\rho - \sigma + \rho}$. ... the optimization is extremely difficult~\cite{fawzi2019semidefinite,girard2014convex}.

%Note: The choice of entropy functional and divergence is application-specific and depends mostly on the simplicity of optimization. In this proposal, we select the von Neumann entropy and divergence~\cite{muller2013quantum}, since it is exactly defined on density matrix and has close relationship to the recently proposed matrix-based entropy functional~\cite{yu2019multivariate}.

\subsection{Justification of the Objective of Graph-PRI}\label{sec:justification}

\begin{comment}
\begin{table}%[!hbpt]
    % \label{tab:connection}
	\centering
	\small
	\caption{Resemblance of PRI in eigenspace of graph Laplacian and in probability space of $i.i.d.$ data.}\label{Tab:pri_similar}
	\begin{tabular}{c|ll}
		\toprule
		& eigenspace (von Neumann entropy) & probability space ($2$-order R{\'e}nyi's entropy) \\
		\midrule
\text{objective} & $\arg \min_{\sigma} (1-\beta)S(\sigma) + 2\beta S\left(\frac{\sigma + \rho}{2}\right)$   &  $\arg \min_f (1-\beta)H_2(f) + 2\beta H_2(f;g)$ \\
$\beta=0$ & $K_{1,n-1}$ \text{graph}  & \text{single~point}  \\
$\beta=1$ & \text{\textcolor{black}{Wather-Filling solution (see Thrm.\ref{th:waterfilling})}}  & \text{mean~shift}  \\
$\beta=\infty$ & \text{raw~Laplacian~spectrum} $\rho$  & \text{raw~distribution} $g$   \\
		\bottomrule
	\end{tabular}
\end{table}
\end{comment}

%\subsubsection{Computation of von Neumann entropy and divergence}


\begin{comment}
and
\begin{eqnarray}\label{eq:VN_divergence}
D(\sigma||\rho) &=& \tr{\sigma \log_2\sigma - \sigma \log_2\rho - \sigma + \rho} \\
&=& \sum_{i}   \lambda_i \log{\lambda_i} - \sum_{i,j} (\mathbf{v}_i^T\mathbf{u}_j)^2 \lambda_i \log{\theta_j} - \sum_i (\lambda_i - \theta_i).
\end{eqnarray}
\end{comment}


%\subsubsection{Justifications on von Neumann Entropy Functional}

% on graph Laplacian (i.e., $S_{\text{vN}}(L)$) 

One may ask why we choose the von Neumann entropy in $\mathcal{J}_{\text{graph-PRI}}$. In fact, the Laplacian spectrum contains rich information about the multi-scale structure of graphs~\citep{mohar1997some}. For example, it is well-known that the second smallest eigenvalue $\lambda_2(L)$, which is also called the algebraic connectivity, is always considered to be a measure of how well-connected a graph is~\citep{ghosh2006growing}.

On the other hand, it is natural to use the QJS divergence to quantify the dissimilarity between the original graph and its sparsified version. The QJS divergence is symmetric and its square root has also recently been found to satisfy the triangle inequality~\citep{virosztek2021metric}. In fact, as a graph dissimilarity measure, QJS has also found applications in multilayer networks compression~\citep{de2015structural} and anomaly detection in graph streams~\citep{chen2019fast}.


A few recent studies indicate the close connections between $S_{\text{vN}}(L)$ with the structure \emph{regularity} and \emph{sparsity} of a graph~\citep{passerini2008neumann,han2012graph,liu2021bridging,simmons2018neumann}. We shall now highlight three theorems therein and explain our justifications in Sections 3.2.1 and 3.2.2 in detail.

\begin{theorem}[\citep{passerini2008neumann}]\label{thm:add_edge}
Given an undirected graph $G=\{V,E\}$, let $G'=G+\{u,v\}$, where $V(G)=V(G')$ and $E(G)=E(G')\cup\{u,v\}$, we have:
\begin{equation}\label{eq:monotonic}
    S_{\text{vN}}(L_{G'})\geq\frac{d_{G'}-2}{d_{G'}} S_{\text{vN}}(L_G),
\end{equation}
where $d_{G'}=\sum_{v\in V(G')} d(v)$ is the \emph{degree-sum} of $G'$, $L_G$ and $L_{G'}$ refer to respectively the graph Laplacians of $G$ and $G’$.
\end{theorem}

Theorem~\ref{thm:add_edge} shows that $S_{\text{vN}}(L)$ tends to grow with edge addition. Although Eq.~(\ref{eq:monotonic}) does not indicate a monotonic increasing trend for $S_{\text{vN}}(L)$, it does suggest that minimizing $S_{\text{vN}}(L)$ may lead to a sparser graph, especially when the degree-sum is large. 


\begin{theorem}[\citep{liu2021bridging}]\label{thm:entropy_diff}
For any undirected graph $G=\{V,E\}$, we have:
\begin{equation}\label{eq:vN_bound_weighted}
%\small
    0\leq \Delta H(G) = H(G)-S_{\text{vN}}(L_G) \leq \frac{\log_2e}{\delta} \frac{\tr{W^2}}{d_G},
\end{equation}
where $H(G)=-\sum_{i=1}^{N}\left(\frac{d_i}{d_G}\right)\log_2{\left(\frac{d_i}{d_G}\right)}$, in which $\delta=\min d_i|d_i>0$ is the minimum positive node degree, $d_G$ is the \emph{degree-sum}, $W$ is the weighted adjacency matrix of $G$.
\end{theorem}

\begin{theorem}[\citep{liu2021bridging}]\label{thm:convergence}
For almost all unweighted graphs $G$ of order $n$, we have:
\begin{equation}\label{eq:convergence}
%\small
    \frac{H(G)}{S_{\text{vN}}(L_G)} - 1 \geq 0,
\end{equation}
and decays to $0$ at a rate of $\mathcal{O}(1/\log_2(n))$.
\end{theorem}

% In case of an undirected unweighted graph, Eq.~(\ref{eq:vN_bound_weighted}) becomes even tighter:
% \begin{equation}
%     0\leq \Delta H(G) \leq \log_2e.
% \end{equation}



\textcolor{black}{Theorem~\ref{thm:entropy_diff} and Theorem~\ref{thm:convergence} bound the difference between $S_{\text{vN}}(L)$ and $H(G)$, the Shannon discrete entropy on node degree. They also indicate that $H(G)$ is a natural choice of the fast approximation to $S_{\text{vN}}(L)$. In fact, there are different fast approximation approaches so far~\citep{chen2019fast,minello2019neumann,kontopoulou2020randomized}. According to~\citep{liu2021bridging}, $H(G)$ enjoys simultaneously good scalability, interpretability and provable
accuracy.}



\subsubsection{$\beta$ controls the sparsity of $G_s$} \label{sec:sparsity}

Different from the spectral sparsifiers~\citep{spielman2011graph,spielman2011spectral} in which the sparsity of the subgraph is hard to control (i.e., there is no monotonic relationship between the hyperparameter $\epsilon$ and the degree of sparisity as measured by $|E_s|$), we argue that the sparsity of subgraph obtained by Graph-PRI is mainly determined by the value of hyperparameter $\beta$. 

%a smaller value of $\beta$ may lead to a more sparse subgraph $G_s$. 

Our argument is mainly based on Theorem~\ref{thm:add_edge}. Here, we additionally claim that, under a mild condition (Assumption~\ref{assumption}), the QJS divergence $D_{\text{QJS}}(L\|L_s)$ is prone to decrease with edge addition (Corollary 1). 

\begin{assumption}\label{assumption}
Given an undirected graph $G=\{V,E\}$, let $G'=G+\{u,v\}$, where $V(G)=V(G')$ and $E(G)=E(G')\cup\{u,v\}$, we have
$S_{\text{vN}}(L_{G'})\geq S_{\text{vN}}(L_G)$,
i.e., there exists a strictly monotonically increasing relationship between the number of edges $|G|$ and the von Neumann entropy $S_{\text{vN}}(L_G)$.
\end{assumption}


\begin{corollary}\label{corollary}
Under Assumption~\ref{assumption}, suppose $G_s=\{V_s,E_s\}$ is a sparse graph obtained from $G=\{V,E\}$ (by removing edges), let $G_s'=G_s+\{u,v\}$, where $\{u,v\}$ is an edge from the original graph $G$, $V(G_s)=V(G_s')$ and $E(G_s)=E(G_s')\cup\{u,v\}$, we have
$D_{\text{QJS}}(L_{G_s'}\|L_G)\leq D_{\text{QJS}}(L_{G_s}\|L_G)$,
i.e., adding an edge is prone to decrease the QJS divergence.
\end{corollary}

\textcolor{black}{We provide additional information on the rigor of Assumption~\ref{assumption} in Appendix~A.1.} Combining Theorem~\ref{thm:add_edge} and Corollary~\ref{corollary}, it is interesting to find that the edge addition has opposite effects on $S_{\text{vN}}$ and $D_{\text{QJS}}$: the former is likely to increase whereas the latter will decrease. Therefore, when minimizing the weighted sum of $S_{\text{vN}}$ and $D_{\text{QJS}}$ together as in Graph-PRI, one can expect the number of edges in $G_s$ is mainly determined by the hyperprameter $\beta$: a smaller $\beta$ gives more weight to $S_{\text{vN}}$ and thus encourages a more sparse graph. 


To empirically justify our argument, we generate a set of graphs with $200$ nodes by either the Erd{\"o}s-R\'enyi (ER) model or the Barab\'asi-Albert (BA) model~\citep{barabasi1999emergence}. For both models, we generate the original dense graph $G$ where the average of the node degree $\bar{d}$ is approximately $10$, $20$ and $30$, respectively. We then sparsify $G$ to obtain $G_s$ by a random sparsifier, which satisfies the spectral property (i.e., Eq.~(\ref{eq:spectral_property})), whose computational complexity is, however, low~\citep{sadhanala2016graph}. 

We finally evaluate the von Neumann entropy of $G_s$ and the QJS divergence between $G$ and $G_s$ with respect to different percentages (pct.) of preserved edges. We repeat the procedure $100$ independent times and the averaged results are plotted in Fig.~\ref{fig:tradeoff}, from which we can clearly observe the opposite effects mentioned above. We also sparisify the original graph $G$ by our Graph-PRI with different values of $\beta$. The number of preserved edges in $G_s$ with respect to $\beta$ is illustrated in Fig.~\ref{fig:monotonic}, from which we can observe an obvious monotonic increasing relationship. 


%we can observe the obvious monotonically increasing relationship between $S_{\text{vN}(G_s)}$ and pct. and the monotonically decreasing relationship between $D_{\text{QJS}}(L\|L_s)$ and pct.


\begin{figure}[ht]
  \begin{subfigure}[]{0.235\textwidth}
    \includegraphics[width=\textwidth]{figures/tradeoff_ER}
    \caption{ER model}
    % \label{}
  \end{subfigure}
  %
  \begin{subfigure}[]{0.235\textwidth}
    \includegraphics[width=\textwidth]{figures/tradeoff_BA}
    \caption{BA model}
  \end{subfigure}
  \caption{The variations of entropy and divergence with respect to different percentages of preserved edges.} 
  \label{fig:tradeoff}
\end{figure}  
  
  
\begin{figure}[ht]
  \begin{subfigure}[]{0.235\textwidth}
    \includegraphics[width=\textwidth]{figures/ER_monotonic}
    \caption{ER model}
    % \label{}
  \end{subfigure}
  %
  \begin{subfigure}[]{0.235\textwidth}
    \includegraphics[width=\textwidth]{figures/BA_monotonic}
    \caption{BA model}
    % \label{}
  \end{subfigure}
  \caption{The monotonic increasing relationship between number of preserved edges and hyperparameter $\beta$ in our Graph-PRI.} 
  %\vspace{-0.7em}
  \label{fig:monotonic}
\end{figure}


%\textbf{\textcolor{blue}{[Trade-off effects between two terms]}}


\subsubsection{Graph-PRI in special cases of $\beta$}

Continuing our discussion in Sec.~\ref{sec:sparsity}, it would be interesting to infer what may happen in some special cases of $\beta$. Here, we restrict our discussion with $\beta=0$ and $\beta \to\infty$. 

When $\beta=0$, due to Theorems~\ref{thm:entropy_diff} and \ref{thm:convergence}, our objective can be interpreted as $\min H(G)$. $H(G)$ takes the mathematical form of the Shannon discrete entropy (i.e., $-\sum_{i=1}^N \mathbb{P}(x_i)\log \mathbb{P}(x_i)$, in which $\mathbb{P}(x_i)$ is the probability of the $i$-th state) on the degree of node. In this sense, $H(G)$ reaches to maximum for uniformly distributed degree of node ($i.e.$, $d_1=\cdots=d_N=k$, which is also called the $k$-regular graph) and reduces to minimum if the degree of one node dominates (i.e., a star graph that possesses a high level of centralization). In fact, it was also conjectured in~\citep{dairyko2017note} that among connected graphs with fixed order $n$, the star graph $S_n$ minimizes the von Neumann entropy. Thus, $S_{\text{vN}}(L)$ can also be interpreted as a measure of degree heterogeneity or graph centrality~\citep{simmons2018neumann}. It also indicates that minimizing $S_{\text{vN}}(L)$ pushes the Graph-PRI to learn a graph that has more graph centrality.

When $\beta\to\infty$, we are expect to recover original graph $G$ by Corollary~\ref{corollary}. Fig.~\ref{fig:pri_demo} corroborates our analysis.

Interestingly, similar properties also hold for the original PRI in scalar random variable setting (see Appendix~B). 

%But we build a bridge between graph data and scalar random variable by PRI.

%Before closing Sec.~\ref{sec:justification}, we would like to emphasize here that although the von Neumann entropy functional has recently found applications in downstream tasks of complex networks and pattern recognition, the majority of the literature simply use it as a statistical measure that do not integrate it in a learning objective that involves numerical optimization such as gradient descent.


\begin{figure*}[ht]
  \begin{subfigure}[]{0.23\textwidth}
    \includegraphics[width=\textwidth]{karate_original}
    \caption{Karate (original)}
    % \label{}
  \end{subfigure}
  %
  \begin{subfigure}[]{0.23\textwidth}
    \includegraphics[width=\textwidth]{karate_beta0}
    \caption{$\beta=0$}
    % \label{}
  \end{subfigure}
  %
%     \begin{subfigure}[]{0.3\textwidth}
%     \includegraphics[width=\textwidth]{karate_beta1}
%     \caption{$\beta=1$}
%     % \label{}
%   \end{subfigure} 
%   %
%     \begin{subfigure}[]{0.3\textwidth}
%     \includegraphics[width=\textwidth]{karate_beta2}
%     \caption{$\beta=2$}
%     % \label{}
%   \end{subfigure}
  %
    \begin{subfigure}[]{0.23\textwidth}
    \includegraphics[width=\textwidth]{karate_beta5}
    \caption{$\beta=5$}
    % \label{}
  \end{subfigure}
  %
  \begin{subfigure}[]{0.23\textwidth}
    \includegraphics[width=\textwidth]{karate_beta1000}
    \caption{$\beta=1000$}
    % \label{}
  \end{subfigure}
  \caption{Illustration of the sparsified graph structures revealed by our Graph-PRI for (a) Zachary's karate club~\cite{zachary1977information}. As the values of $\beta$ increases, the solution passes through (b) an approximately star graph to the extreme case of (d) $\beta\rightarrow\infty$, in which we get back the original graph as the solution.}
  %\vspace{-0.7em}
  \label{fig:pri_demo}
\end{figure*}


%To make our graph topology close to real world scenario, we generate a dense scale-free network with 50 nodes via Barab\'{a}si-Albert model. Then, the graph is sparsified using k-neighbors sparsifier in \cite{sadhanala2016graph} with different $k$ values. Fig.~\ref{fig:graph_topology} shows the topology of original graph and its sparsified versions (value of $k$ is shown in the parenthesis). Fig.~\ref{fig:information_values} shows the values of von Neumann entropy (computed with Eq.~(\ref{eq:VN_entropy})) and Jessen-Shannon divergence (computed with Eq.~(\ref{eq:jsd})) with respect to different values of $k$.




\subsection{Optimization}
We define a gradient descendent algorithm to solve Eq.~(\ref{eq:pri_final}). As has been discussed in Section~\ref{sec:problem}, we have $\rho=BB^T$ and $\sigma_\mathbf{w}=B\diag{(\mathbf{w})}B^T$, in which $\mathbf{w}$ is the edge selection vector. For simplicity, we assume that the selections of edges from the original graph $G$ are conditionally independent to each other~\citep{luo2020parameterized}, that is $\mathbb{P}_\mathbf{w}=\prod_{i=1}^{M} \mathbb{P}_{w_i}$. Due to the discrete nature of $G_s$, we relax $\mathbf{w}=[w_1,w_2,...,w_M]$ from a binary vector $\{0,1\}^M$ to a continuous real-valued vector in $[0,1]^M$. In this sense, the value of $w_i$ can be interpreted as the probability of selecting the $i$-th edge.

% Thus, our objective reduces to:
% \begin{eqnarray} \label{eq:cost}
% J_\beta(\rho, \sigma_\mathbf{w}) = S(\tilde{\sigma}_{\mathbf{w}}) + \beta D(\tilde{\sigma}_{\mathbf{w}}||\tilde{\rho})
% \end{eqnarray}

%To compute the gradient of Eq.~(\ref{eq:cost}), we use $L$, the Laplacian matrix of the original graph. We use the incident matrix definition of the Laplacian matrix $L = E\diag{v}E^T$, where $v$ is the vector of the edge values of the original graph. We also describe the target graph in term of the Laplacian matrix $L_w$. We assume that the new graph is a sub set of the original graph, in this way we can express $L_w=E\diag{w}E^T$. As density matrix of the graph we then we use the Laplacian matrices, thus $\sigma_w = L_w$ and $\rho = L$. 

% The RIGP algorithm iteratively computes the gradient given in Theorem~\ref{th:grad} and identifies the edges that contribute most on the reduction of the quantity defined in Eq.~(\ref{eq:cost}). PRGP sets these edges to $1$, when starting from the empty graph. The algorithm iterates till stopping criteria is met (for example no more change on the edges). The algorithm is presented in Algorithm~\ref{alg:RIGP}. 

%{\color{blue} In Fig.~\ref{fig:RIGP1}, the algorithm starts from left and stops when the optimal point is reached. An alternative algorithm is $w(i) = \frac1{2} (1 - \text{sign} ( g_w(i)) )$. The problem defined by Eq.\ref{eq:pri} is combinatorial whose complexity is $2^N$. Alg.\ref{alg:RIGP} is not guaranteed to find the minimum, but is linear in $N$.}




%which follows the Bernoulli distribution.

% $\text{Bernoulli}(w_i)$

% Instead of sampling a hard one-hot vector

In practice, we use the Gumbel-softmax~\citep{maddison2017concrete,jang2016categorical} to update $w_i$. Particularly, suppose we want to approximate a categorical random variable represented as a one-hot vector in $\mathbb{R}^K$ with category probability $p_1,p_2,\cdots,p_K$ (here, $K=2$), the Gumbel-softmax gives a $K$-dimensional sampled vector with the $i$-th entry as:
\begin{equation}
    \hat{p}_i = \frac{\exp\left((\log p_i + g_i)/\tau\right)}{\sum_{j=1}^K \exp\left((\log p_j + g_j)/\tau\right)},
\end{equation}
where $\tau$ is a temperature for the Concrete distribution and $g_i$ is generated from a $\text{Gumbel}(0,1)$ distribution:
\begin{equation}
    g_i = -\log (-\log u_i), \quad u_i\sim \text{Uniform}(0,1).
\end{equation}

% $w_i$ can be parameterized as a $2$-dimensional vector $(p_0, p_1)$ with $\theta_i$ is the probability that $\theta_i = \mathbb{P}(w_i=r), r=0,1$.

% with gradient-based methods $w \sim \text{Bernoulli}_{\theta}(w)$:

% \begin{equation}
% \text{Bernoulli}(w)=\sigma\left(\frac{1}{\lambda}(\log{w}-\log{(1-w)}+\log{u}-\log{(1-u)})\right),
% \end{equation}
% \begin{align}
% \text{Bernoulli}_\theta(w) =\sigma\left(\frac{1}{\lambda}(\log{\theta}-\log{(1-\theta)}+\log{u}-\log{(1-u)})\right),
% \end{align}
% $$
% \theta^{k+1} = \theta^k - \eta \nabla_\theta \E_{w \sim \text{Bernoulli}_\theta(w) } D_\beta(\rho, \sigma_w)
% $$
%where $\sigma(\cdot)$ is the Sigmoid activation function, $u\sim \text{Uniform}(0,1)$ and $\lambda$ is a temperature for the Concrete distribution. Intuitively, $\mathbf{w}$ is trained to assign a high probability to edges that are informative or significant for preserving the graph structure.



Note that, although we use the Gumbel-Softmax to ease the optimization, Graph-PRI itself has analytical gradient (Theorem~\ref{th:grad}).
The detailed algorithm of Graph-PRI is elaborated in Appendix~E. We also provide a PyTorch example therein.

% \footnote{$v = \mvec{A}$, $A = [v_{ij}]$ the adjacent matrix}

{\color{black}
\begin{theorem} \label{th:grad}
The gradient of Eq.~(\ref{eq:pri_final}) with respect to edge selection vector $\mathbf{w}$ is:
% \begin{eqnarray}
% \nabla_w D_\beta(\rho, \sigma_w) = -\diag{\left( M^T \left[ \left(1-\frac{\beta}{2}\right)\ln \sigma_w +\frac{\beta}{2}\ln \bar{\sigma}_w\right]M \right)}
% \end{eqnarray}
% \D_\beta(\rho, \sigma_w) 
% \begin{align}
% \nabla_w 
% \mathcal{J}_{\text{Graph-PRI}} = -\diag{\left( M^T \left[ \left(1-\beta\right)\ln \sigma_w +\beta\ln \bar{\sigma}_w\right]M \right)}
% \end{align}
\begin{align}
\nabla_\mathbf{w} 
\mathcal{J}_{\text{Graph-PRI}} = U g,
\end{align}
where $\tilde{\mathbf{w}}$ is the normalised $\mathbf{w}$ ($\tilde{\mathbf{w}} = \mathbf{w} / \sum_{i=1}^M w_i$), $\tilde{\mathbf{1}}_{M}  = \frac1{M} \mathbf{1}_{M} $ is the normalized version of the all-ones vector. $\bar{\sigma}_\mathbf{w} = \frac{1}{2} \left( \tilde{\sigma}_\mathbf{w} +\tilde{\rho} \right) = \frac{1}{2} B  \diag{\left(\tilde{\mathbf{w}} + \tilde{\mathbf{1}}_{M} \right)} B^T$. $g = -\diag{\left( B^T \left[ \left(1-\beta\right)\ln \tilde{\sigma}_\mathbf{w} +\beta\ln \bar{{\sigma}}_\mathbf{w}\right]B \right)}$ and $U = \{ u_{ij} \} \in \mathbb{R}^{M\times M}, u_{ij} = -\frac{\tilde{w}_j}{1-\tilde{w}_i},  \forall ij | i \ne j, u_{ii} = 1.$
\end{theorem}
}


% %While $\tilde{\mathbf{w}} = \mathbf{w} / \sum_{i=1}^M w_i$ and $\tilde{\mathbf{1}}_{M}  = \frac1{M} \mathbf{1}_{M}$.
% where $\bar{\sigma}_w = \frac{1}{2} \left( \tilde{\sigma}_w +\tilde{\rho} \right) = \frac{1}{2} M  \diag{\left(\tilde{\mathbf{w}} + \tilde{\mathbf{1}}_{M} \right)} M^T$ and $g = -\diag{\left( M^T \left[ \left(1-\beta\right)\ln \tilde{\sigma}_w +\beta\ln \bar{{\sigma}}_w\right]M \right)}$ and $U = \{ u_{ij} \}, u_{ij} = -\frac1{M-1},  \forall ij | i \ne j, u_{ii} = 1$. 
% \commentR{punctuation}

%\vspace{-0.7em}
% \begin{algorithm}[htb]
% \caption{PRI for Graph Sparsification}
% \label{alg:PRI_Graph}
% \begin{algorithmic}[1]
% \Require
% $\rho = MM^T$, $\beta$, learning rate $\eta$, mini-batch size $B$
% \Ensure
% $\sigma_{\mathbf{w}}$
% \State $M\gets$ incident matrix of $\rho$;
% \State Initialize $\mathbf{\theta}=\{\theta_1,\theta_2,\cdots,\theta_M\}$;
% \While {not converged}
% \For{$i=1,2,\cdots,B$}
%   \State $\mathbf{w}^i\gets$ Gumbel-softmax sampling from $p_{\mathbf{\theta}}=\Sigmoid(\mathbf{\theta})$;
%   \State $\sigma_{\mathbf{w}^i} = M \diag{(\mathbf{w}^i)} M^T$;
% \EndFor
% %\State $w(i^*)=v(i^*)$ \Comment{or $1$ if we want to have graph of unit edge value}\;
% \State $\mathbf{\theta} \gets \mathbf{\theta} - \eta \frac{1}{B} \sum_{i=1}^B \nabla_{\mathbf{\theta}} D_\beta(\rho, \sigma_{\mathbf{w}^i})$;
% \EndWhile
% \State $\mathbf{w}\gets$ Gumbel-softmax sampling from $p_{\mathbf{\theta}}=\Sigmoid(\mathbf{\theta})$; \\
% \Return $\sigma_{\mathbf{w}} = M \diag{(\mathbf{w})} M^T$;
% \end{algorithmic}
% \end{algorithm}
%\vspace{-0.7em}


\subsection{Approximation and Connectivity Constraint}\label{sec:approximation}
\textcolor{black}{The computation of von Neumann entropy requires the eigenvalue decomposition of a trace normalized SPS matrix, which usually takes $\mathcal{O}(N^3)$ time. In practical applications in which the computational time is a major concern (i.e., when training deep neural networks or when dealing with large graphs with hundreds of thousands of nodes), based on Theorems~\ref{thm:entropy_diff} and \ref{thm:convergence}, we simply approximate $S_{\text{vN}}(L_G)$ with the Shannon discrete entropy on the normalized degree of nodes $H(G)$, which immediately reduces the computational complexity to $\mathcal{O}(N)$. Unless otherwise specified, the experiments in the next section still use the basic $S_{\text{vN}}(L_G)$.}

% Recent efficient approximations include [?] and [?]. 

%\textbf{\textcolor{blue}{[Bound the value of cross-term $S_{\text{vN}} (\frac{\rho+\sigma}{2})$]}}

On the other hand, when the connectivity of the subgraph is preferred, one can simply add another regularization  on the degree of the nodes~\citep{kalofolias2016learn}:
\begin{eqnarray} \label{eq:cost_connectivity}
\min_{\mathbf{w}} S(\tilde{\sigma}_{\mathbf{w}}) + \beta D(\tilde{\sigma}_{\mathbf{w}}||\tilde{\rho}) - \alpha \mathbf{1}^T \log{(\diag{(\sigma)})},
\end{eqnarray}
where the hyper-parameter $\alpha>0$. This Logarithm barrier forces the degree to be positive and improves the connectivety of graph without compromising sparsity. Unless otherwise specified, we select $\alpha=0.005$ throughout this work. 

%We select $\alpha=0.01$ in this paper.



\section{Experimental Evaluation}\label{sec:experiments}

%We present three representative machine learning applications to demonstrate the effectiveness and versatility of our Graph-PRI. 

In this section, we demonstrate the effectiveness and versatility of our Graph-PRI in multiple graph-related machine learning tasks. Our experimental study is guided by the following three questions:
%\begin{itemize}[leftmargin=*]
\begin{itemize}
 \item[\textbf{Q1}] What kind of structural property or information does our method preserves? 
 \item[\textbf{Q2}] How well does our method compare against popular and competitive graph sparsification baselines?
 \item[\textbf{Q3}] How to use the Graph-PRI in practical machine learning problems; and what are the performance gains?
\end{itemize}

% \footnote{For weighted graph, the sampling rate is proportional to edge weight.}

The selected competing methods include $1$ baseline and $3$ state-of-the-art (SOTA) ones: 1) the Random Sampling (RS) that randomly prunes a percentage of edges; 2) the Local Degree (LD)~\citep{hamann2016structure} that only preserves the top $|\text{degree}(v)^\alpha|$ ($0\leq\alpha\leq 1$) neighbors (sorted by degree in descending order) for each node $v$; 3) the Local Similarity (LS)~\citep{satuluri2011local}) that applies Jaccard similarity function on
nodes $v$ and $u$'s neighborhoods to quantify the score of edge $(u,v)$; 4) the Effective Resistance (ER)~\citep{spielman2011graph}. We implement RS, LD, LS by NetworKit\footnote{\url{https://networkit.github.io/}}, and ER by PyGSP\footnote{\url{https://github.com/epfl-lts2/pygsp}}.

%neighborhoods $N(v)$ and $N(u)$. 

%For node $v$, its connected edges are ranked and preserved according to JS score.  

  

% For SparRL, we use authors' original code.

%We compare Graph-PRI with one baseline method ($i.e.$, the random sampling or RS that randomly prunes a percentage of edges), three popular graph sparsification techniques ($i.e.$, the Effective Resistance or ER~\cite{spielman2011graph}, the Local Degree~\cite{hamann2016structure}, and the L-Spar~\cite{satuluri2011local}), and an additional recently proposed state-of-the-art learning-based method ($i.e.$, the SparRL~\cite{wickman2021sparrl} that uses reinforcement learning to sequentially remove less informative edges).

%Code is available at \url{https://github.com/SJYuCNEL/Bregman-Correntropy}.





\subsection{Graph Sparsification}

We use $2$ synthetic data and $4$ real-world network data from KONECT network datasets\footnote{\url{http://konect.cc/networks/}} for evaluation. 
They are, 
\textbf{G1}: a $k$-NN ($k=10$) graph with $20$ nodes that constitute a global circle structure; 
\textbf{G2}: a stochastic block model (SBM) with four distinct communities ($30$ nodes in each community, and intra- and inter-community connection probabilities of $2^{-2}$ and $2^{-7}$, respectively);
\textbf{G3}: the most widely used Zachary karate club network ($34$ nodes and $78$ edges); \textbf{G4}: a network contains contacts between suspected terrorists involved in the train bombing of Madrid on March $11$, $2004$ ($64$ nodes and $243$ edges); \textbf{G5}: a network of books about US politics published around the time of the $2004$ presidential election and sold by the online bookseller Amazon.com ($105$ nodes and $441$ edges); and \textbf{G6}: a collaboration network of Jazz musicians ($198$ nodes and $2,742$ edges).

%We argue that PRI has advantages in preserving two essential properties associated with the original graph: 1) the spectral distance; and 2) the graph centrality. 

%Graph-PRI minimizes a weighted combination of von Neumann entropy and QJS divergence. The first term can be interpreted as the Shannon entropy on degree of node and thus promotes graph centrality. The second term is a distance measure. Therefore, 

We expect Graph-PRI to preserve two essential properties associated with the original graph: 1) the spectral similarity (due to the divergence term); and 2) the graph centrality (due to the entropy term). We empirically justify our claims with two metrics. They are, the geodesic distance $d_{\vec{x}}(\rho,\sigma)$~\citep{bravo2019unifying}:
\begin{equation}
    d_{\vec{x}}(\rho,\sigma) = \mathrm{arccosh}\left(1+\frac{\|(\rho-\sigma)\vec{x}\|_2^2\|\vec{x}\|_2^2}{2(\vec{x}^T\rho\vec{x})(\vec{x}^T\sigma \vec{x})}\right),
\end{equation}
in which we select $\vec{x}$ to be the smallest non-trivial eigenvector of the original Laplacian $\rho$, as it encodes the global structure of a graph; and the graph centralization measure by $C_D$~\citep{freeman1978centrality}):
\begin{equation}
    C_D = \frac{\sum_{i=1}^N \max(d_j) - d_i}{N^2-3N+2},
\end{equation}
in which $\max(d_j)$ refers to the maximum node degree.

We demonstrate in Fig.~\ref{fig:spectral_distance} and Fig.~\ref{fig:centrality} respectively the values of $d_{\vec{x}}(\rho,\sigma)$ and $C_D$ with respect to different edge preserving ratio (i.e., $|E_s|/|E|$) for different sparsification methods. As can be seen, our Graph-PRI always achieves the $2$\textsuperscript{nd} best performance across different graphs. Although LD has advantages on preserving spectral distance and graph centrality, it does not have compelling performance in practical applications as will be demonstrated in the next subsection.


%For $d_{\vec{x}}(\rho,\sigma)$, .

%Karate; Jazz musicians; $2004$ Madrid train bombings; Political books; 



%Following~\citep{spielman2011spectral,bravo2019unifying}, three widely used network data are selected for evaluation. They are, \textbf{D1}: a collaboration network of Jazz musicians ($198$ nodes and $2,742$ edges) from~\cite{gleiser2003community}; \textbf{D2}: the \emph{C. elegans} posterior nervous system connectome ($269$ nodes and $2,902$ edges) from~\citep{jarrell2012connectome}; and \textbf{D3}: a weighted social network of face-to-face interactions between primary school students, with initial edge weights proportional to the number of interactions between pairs of students ($236$ nodes and $5,899$ edges) from~\citep{stehle2011high}.


%\begin{comment}
% \begin{figure}[ht]
% \centering
% \includegraphics[width=0.45\textwidth]{centrality_karate.pdf}
% \caption{Quantitative Evaluation on graph centrality ($x$-axis is the percentage of preserved edges).}
% %\vspace{-0.7em}
% \label{fig:information_values}
% \end{figure}

\begin{figure*}[ht]
  \begin{subfigure}[]{0.16\textwidth}
    \includegraphics[width=\textwidth]{sec1_figures/syn_distance.pdf}
    \caption{$k$-NN}
    % \label{}
  \end{subfigure}
  %
  \begin{subfigure}[]{0.16\textwidth}
    \includegraphics[width=\textwidth]{sec1_figures/community_distance.pdf}
    \caption{SBM}
    % \label{}
  \end{subfigure}
  \begin{subfigure}[]{0.16\textwidth}
    \includegraphics[width=\textwidth]{sec1_figures/karate_distance.pdf}
    \caption{Karate club}
    % \label{}
  \end{subfigure}
  %
  \begin{subfigure}[]{0.16\textwidth}
    \includegraphics[width=\textwidth]{sec1_figures/spain_distance.pdf}
    \caption{Train bombing}
    % \label{}
  \end{subfigure}
    \begin{subfigure}[]{0.16\textwidth}
    \includegraphics[width=\textwidth]{sec1_figures/pol_distance.pdf}
    \caption{US political books}
    % \label{}
  \end{subfigure}
  %
  \begin{subfigure}[]{0.16\textwidth}
    \includegraphics[width=\textwidth]{sec1_figures/jazz_distance.pdf}
    \caption{Jazz musicians}
    % \label{}
  \end{subfigure}
  \caption{Spectral distance $d_{\vec{x}}(\rho,\sigma)$ (the smaller the better).}
  %\vspace{-0.7em}
  \label{fig:spectral_distance}
\end{figure*}



\begin{figure*}[ht]
  \begin{subfigure}[]{0.16\textwidth}
    \includegraphics[width=\textwidth]{sec1_figures/syn_centrality.pdf}
    \caption{$k$-NN}
    % \label{}
  \end{subfigure}
  %
  \begin{subfigure}[]{0.16\textwidth}
    \includegraphics[width=\textwidth]{sec1_figures/community_centrality.pdf}
    \caption{SBM}
    % \label{}
  \end{subfigure}
  \begin{subfigure}[]{0.16\textwidth}
    \includegraphics[width=\textwidth]{sec1_figures/karate_centrality.pdf}
    \caption{Karate club}
    % \label{}
  \end{subfigure}
  %
  \begin{subfigure}[]{0.16\textwidth}
    \includegraphics[width=\textwidth]{sec1_figures/spain_centrality.pdf}
    \caption{Train bombing}
    % \label{}
  \end{subfigure}
    \begin{subfigure}[]{0.16\textwidth}
    \includegraphics[width=\textwidth]{sec1_figures/pol_centrality.pdf}
    \caption{US political books}
    % \label{}
  \end{subfigure}
  %
  \begin{subfigure}[]{0.16\textwidth}
    \includegraphics[width=\textwidth]{sec1_figures/jazz_centrality.pdf}
    \caption{Jazz musicians}
    % \label{}
  \end{subfigure}
  \caption{Graph centrality $C_D$ (the larger the higher degree of centrality).}
  %\vspace{-0.7em}
  \label{fig:centrality}
\end{figure*}



\begin{comment}
\subsubsection{Airport Dataset} The \texttt{Airport} dataset\footnote{\url{http://konect.uni-koblenz.de/networks/},???}
consists of the directed network of flights between US airports in $2010$. Each edge is a connection between two airports, while the edge is the number of flights. It is composed by $1'227$ edges and  $2'940$ nodes. The sparse graph consists of $1'692$ edges, where we use $\beta=100$.
Fig.\ref{fig:Airports} shows the application of the relevant information principle to the \texttt{Airport} dataset. We also show the Markov Clustering \cite{van2000graph} applied to the original and sparse graphs.
\end{comment}


\begin{comment}
\begin{figure*}%[t!]
	\centering
	\begin{subfigure}{.5\textwidth}
		\centering
		\includegraphics[width=1.\linewidth,trim=0 0cm 0 1cm, clip]{airports_full}
		\caption{Airports (full)}
		\label{fig:Airports_full}
	\end{subfigure}%
	\begin{subfigure}{.5\textwidth}
		\centering
		\includegraphics[width=1.\linewidth,trim=0 0cm 0 1cm, clip]{airports_sparse}
		\caption{Airports (sparse)}
		\label{fig:Airports_sparse}
	\end{subfigure}	
	\\
	\begin{subfigure}{.5\textwidth}
		\centering
		\includegraphics[width=1.\linewidth,trim=0 0cm 0 1cm, clip]{airports_clustered}
		\caption{Airports (full) with clustering}
		\label{fig:Airports_full}
	\end{subfigure}%
	\begin{subfigure}{.5\textwidth}
		\centering
		\includegraphics[width=1.\linewidth,trim=0 0cm 0 1cm, clip]{airports_sparse_clustered}
		\caption{Airports (sparse) with clustering}
		\label{fig:Airports_sparse}
	\end{subfigure}	
	\caption{Visualization of the largest connected component of the \texttt{Airport} dataset, before and after the graph reduction.}
	\label{fig:Airports}
\end{figure*}
\end{comment}



\subsection{Graph-Regularized Multi-task Learning}\label{sec:MTL}
In traditional multi-task learning (MTL), we are given a group of $T$ related tasks. In each task we have access to a training set $\mathcal{D}_t$ with $N_t$ data instances $\{(\mathbf{x}_t^i,y_t^i):i=1,\cdots,N_t, t=1,\cdots,T\}$. In this section, we focus on the regression setup in which $\mathbf{x}_t^i\in\mathcal{X}_t\subseteq\mathbb{R}^d$ and $y_t^i\in\mathbb{R}$. Multi-task learning aims to learn from each training set $\mathcal{D}_t$ a prediction model $f_t(\mathbf{w}_t,\cdot):\mathcal{X}_t\rightarrow\mathbb{R}$ with parameter $\mathbf{w}_t$ such that the task relatedness is taken into consideration and the overall generalization error is small. 

% ~\citep{zhang2021survey}

In what follows, we assume a linear model in each task, i.e., $f_t(\mathbf{w}_t,\mathbf{x})=\mathbf{w}_t^T\mathbf{x}$.
The multi-task regression problem with a regularization $\Omega$ on the model parameters $W=[\mathbf{w}_1,\mathbf{w}_2,\cdots,\mathbf{w}_T]$ can thus be defined as:
\begin{equation}\label{eq:obj1}
\min_{W} \sum_{t=1}^T\|\mathbf{w}_t^T\mathbf{x}_t-y_t\|_2^2 + \gamma\Omega(W).
\end{equation}


% W\in \mathbf{M}_{d,T}

%where $W=[\mathbf{w}_1,\mathbf{w}_2,\cdots,\mathbf{w}_T]$ consists of columns $\mathbf{w}_t$.

% These tasks may be viewed as drawn from an unknown joint distribution of tasks, which is the source of the bias that relates the tasks. 

%we are given a group of $T$ tasks, each one is characterized by its input $\mathbf{x}$ and output $y$. 


%We aim to learn the regression coefficient for $T$ tasks (denote them $\mathbf{w}_1,\mathbf{w}_2,\cdots,\mathbf{w}_T$) simultaneously by incorporating the relatedness amongst these tasks. 

% A larger edge weights indicates a stronger relationship. 

Graph is a natural way to establish the relationship over multiple tasks: each node refers to a single task; if two tasks are strongly correlated to each other, there is an edge to connect them. In this sense, the objective for multi-task regression learning regularized with a graph adjacency matrix $A$ can be formulated as~\citep{he2019efficient}:
\begin{equation}\label{eq:obj_MTL}
\min_{W} \sum_{t=1}^T\|\mathbf{w}_t^T\mathbf{x}_t-y_t\|_2^2 + \gamma\sum_{i=1}^T\sum_{j\in\mathcal{N}_i}A_{ij}\|\mathbf{w}_i-\mathbf{w}_j\|_2^2,
\end{equation}
where $\mathcal{N}_i$ is the set of neighbors of $i$-th task.

%$W=[\mathbf{w}_1,\mathbf{w}_2,\cdots,\mathbf{w}_T]$ consists of columns $\mathbf{w}_t$.

Usually, a dense graph $G$ is estimated at first to fully characterize task relatedness~\citep{chen2010graph,he2019efficient}. Here, we are interesting in: 1) sparsifying $G$ to reduce redundant or less-important connections (edges) between tasks; and 2) validating if the sparsified graph can further reduce the generalization error.  

%We optimize objective~(\ref{eq:obj_MTL}) with the Combinatorial Multigrid (CMG) solver~\cite{koutis2011combinatorial} and test it on school data set~\cite{bakker2003task}. 

To this end, we exemplify our motivation with the recently proposed Convex Clustering Multi-Task regression Learning (CCMTL)~\citep{he2019efficient} that optimizes Eq.~(\ref{eq:obj_MTL}) with the Combinatorial Multigrid (CMG) solver~\citep{koutis2011combinatorial}, and test its performance on two benchmark MTL datasets\footnote{See Appendix~C on details of datasets in sections~\ref{sec:MTL} and \ref{sec:brain}.}: 1) a synthetic dataset~\cite{gonccalves2016multi} with $20$ tasks in which tasks $1$-$10$ are mutually related and tasks $11$-$20$ are mutually related; 2) a real-world Parkinson's disease dataset\footnote{\url{https://archive.ics.uci.edu/ml/datasets/parkinsons+telemonitoring}} which contains biomedical voice measurements from $42$ patients. We view each patient as a single task and aim to predict the motor Unified Parkinson's Disease Rating Scale (UPDRS) score based $19$-dimensional features such as age, gender, and jitter and shimmer voice measurements. \textcolor{black}{In both datasets, the initial dense task-relatedness graph $G$ is estimated in the following way: we perform linear regression on each task individually; the task-relatedness between two tasks is modeled as the $\ell_2$ distance of their independently learned linear regression coefficients; we then construct a $k$-nearest neighbor ($k=10$) graph based on all pairwise task distances as $G$.}


\begin{figure}[ht]
	\centering
	    \begin{subfigure}{.23\textwidth}
	    \centering
        \includegraphics[width=\textwidth]{figures/sparsity_curve_synthetic.pdf}
        \caption{Synthetic data}
    %\vspace{-0.7em}
    %\label{fig:MTL_curves_synthetic}
	\end{subfigure}%
	\begin{subfigure}{.23\textwidth}
	    \centering
        \includegraphics[width=\textwidth]{figures/sparsity_curve_parkinsons.pdf}
        \caption{Parkinsons's disease data}
%\vspace{-0.7em}
    %\label{fig:MTL_curves_parkinsons}
	\end{subfigure}	
	\caption{The RMSE with respect to the degree of sparsity (defined as $1-|E_s|/|E|$) of the resulting subgraph for all competing methods. Black dashed line indicates performance without any edge pruning. Our method is able to drop out redundant or less-important edges to further reduce generalization error.}
	\label{fig:MTL_curves}
\end{figure}


We evaluate the test performance with the root mean squared error (RMSE) and demonstrate the values of RMSE with respect to different edge pruning ratio (i.e., $1-|E_s|/|E|$) of different methods in Fig.~\ref{fig:MTL_curves}. 
In synthetic data, only Graph-PRI and LS are able to further reduce test error. For Graph-PRI, this phenomenon occurs at the beginning of pruning edges, which indicates that our method begins to remove less-informative or spurious connections in an early stage. In Parkinson's data, most of methods obtain almost similar performances to ``no edge pruning" (with Graph-PRI performs slightly better as shown in the zoomed plot), which suggests the existence of redundant task relationships. One should note that, the performance of Graph-PRI becomes worse if we remove large amount of edges. One possible reason is that when $|E_s|$ is small, our subgraph tends to have a high graph centrality or star shape, such that one task dominates. Note however that, in MTL, keeping a very sparse relationship is usually not the goal. Because it may lead to weak collaboration between tasks, which violates the motivation of MTL.

% \begin{figure}[ht]
% \centering
% \includegraphics[width=0.45\textwidth]{figures/sparsity_curve_synthetic.pdf}
% \caption{Synthetic data}
% %\vspace{-0.7em}
% \label{fig:MTL_curves_synthetic}
% \end{figure}

% \begin{figure}[ht]
% \centering
% \includegraphics[width=0.45\textwidth]{figures/sparsity_curve_parkinsons.pdf}
% \caption{Parkinsons's disease data}
% %\vspace{-0.7em}
% \label{fig:MTL_curves_parkinsons}
% \end{figure}





%This data set is to estimate examination scores of $15,362$ students from $139$ secondary schools in London from $1985$ to $1987$ where each school is treated as a task. The input consists of four school-specific and three student-specific attributes. We first build a dense $k$-NN ($k=30$) graph with approximately $2200$ edges. The initial rooted mean square error (RMSE) is $10.119$. We then sparsify the dense graph with three competing methods, namely PRI, effectiveness resistance~\cite{spielman2011graph} and random sampling, and re-perform multi-task learning over the sparsified graph. The RMSE values of different methodologies with respect to different reserved number of edges are summarized in Table~\ref{Tab:synthetic}. Obviously, PRI achieves the best performance.



% \begin{table}[!hbpt]
% \centering
% \caption{RMSE (mean) on school data set over $10$ independent runs. The best performance is marked in bold.}\label{Tab:synthetic}
% \begin{tabular}{ccccccc}
% \toprule
%  & $\sim 700$ & $\sim 900$ & $\sim 1500$ \\
% \midrule
% Effective Resistance & $10.176$ &	$10.141$	& $10.121$ \\
% Random Sampling & $10.167$ &	$10.151$ &	$10.105$ \\
% PRI & $\mathbf{10.164}$	& $\mathbf{10.139}$ &	$\mathbf{10.084}$ \\
% \bottomrule
% \end{tabular}
% \end{table}

%\commentR{Suggestion:}

% Medical imaging-based  ({\lowercase{f}MRI})  

\subsection{{\lowercase{f}MRI}-derived Brain Network Classification and Interpretability}\label{sec:brain}

Brain networks are complex graphs with anatomic brain regions of interest (ROIs) represented as nodes and functional connectivity (FC) between brain ROIs as links. For resting-state functional magnetic resonance imaging (rs-fMRI), the Pearson's correlation coefficient between blood-oxygen-level-dependent (BOLD) signals associated with each pair of ROIs is the most popular way to construct FC network~\citep{farahani2019application}.

In the problem of brain network classification, the identification of predictive subnetworks or edges is perhaps one of the most important tasks, as it offers a mechanistic understanding of neuroscience phenomena~\citep{wang2021learning}. Traditionally, this is achieved by treating all the connections (i.e., the Pearson's correlation coefficients) of FC as a long feature vector, and applying feature selection techniques, such as LASSO~\citep{tibshirani1996regression} and two-sample t-test, to determine if one edge connection is significantly different in different groups (e.g., patients with Alzheimer's disease with respect to normal control members).

%Recently, the graph neural networks (GNNs) have gained increasing attention for brain network analysis owing to their powerful representation ability to capture the sophisticated brain network structures~\cite{kim2020understanding,li2021braingnn}. Despite improved performance in classification accuracy, GNN itself suffers from poor interpretability or transparency in its decision-making process.

In this section, we develop a new graph neural networks (GNNs) framework for interpretable brain network classification that can infer brain network categories and identify the most informative edge connections, in a joint end-to-end learning framework. We follow the motivation of~\citep{cui2021brainnnexplainer} and aim to learn a global shared edge mask $M$ to highlight decision-specific prominent brain network connections. The final explanation for an input graph $G_i$ is generated by the element-wise product of $A_i$ and $\sigma(M)$, i.e., $A_i\odot\sigma(M)$, in which $A_i$ is the adjacency matrix of $G_i$, $\sigma$ refers to the sigmoid activation function that maps $M$ to $[0,1]^{N\times N}$.
Obviously, $\sigma(M)$ in our GNN also plays a similar role to the edge selection vector $\mathbf{w}$ in Graph-PRI.

% Similar to the edge selection vector $\mathbf{w}$ in Graph-PRI, 

%Note that, a significant difference between our framework and the BrainNNExplainer~\citep{cui2021brainnnexplainer} is that we infer network label $y$ and learn global mask $M$ simultaneously in a joint framework, whereas BrainNNExplainer performs two tasks separately. Meanwhile, the training objectives are also totally different. 

\noindent
\textbf{Problem definition.}
Given a weighted brain network $G=(V,E,W)$, where $V=\{v_i\}_{i=1}^N$ is the node set of size $N$ defined by the ROIs, $E$ is the edge set, and $W\in\mathbb{R}^{N\times N}$ is the weighted adjacency matrix describing FC strengths between ROIs, the model outputs a prediction label $y$. In brain network analysis, $N$ remains the same across subjects.

\noindent
\textbf{Experimental data.}
We evaluate our method on two benchmark real-world brain network datasets. The first one is the eyes open and eyes closed (EOEC) dataset~\citep{zhou2020toolbox}, which includes $96$ brain networks with the goal to predict either eyes open or eyes closed states. The second one is from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database\footnote{\url{http://adni.loni.usc.edu/}}. We use the brain networks generated by~\citep{kuang2019concise}, with the task of distinguishing mild cognitive impairment (MCI)\footnote{MCI is a transitional stage between AD and NC.} group ($38$ patients) from normal control (NC) subjects ($37$ in total).
Details on brain network construction are elaborated in Appendix~C.




\noindent
\textbf{Methodology and objective.}
Following~\citep{cui2021brainnnexplainer}, we provide interpretability by learning an edge mask $M\in\mathbb{R}^{N\times N}$ that is shared across all subjects to highlight the disease-specific prominent ROI connections. Motivated by the functionality of PRI to prune redundant or less informative edges as demonstrated in previous sections, we train $M$ such that the resulting subgraph $G'=G\odot\sigma(M)$ and the original graph $G$ meets the PRI constraint, i.e., Eq.~(\ref{eq:pri_graph}). Therefore, the final objective of our interpretable GNN can be formulated as:
\begin{equation}
    \mathcal{L}_{\text{CE}} + \lambda \E_{G \sim p(G)}\left\{S_{\text{vN}}(G’) + \beta D_{\text{QJS}}(G’||G) \right\},
\end{equation}
in which $\mathcal{L}_{\text{CE}}$ refers to the supervised cross-entropy loss for label prediction, $\lambda$ is the hyperparameter that balances the trade-off between $\mathcal{L}_{\text{CE}}$ and PRI constraint. 

% \underbrace{\left\{S(G’) + \beta D(G’||G) \right\}}_\text{PRI}

\noindent
\textbf{Empirical results.}
We summarize the classification accuracy ($\%$) with different methods over $10$ independent runs in Table~\ref{tab:table_brain}, in which Graph-PRI* refers to our objective implemented by approximating von Neumann entropy with Shannon discrete entropy functional on the normalized degree of nodes (see Section~\ref{sec:approximation}). As can be seen, our method achieves compelling or higher accuracy in both datasets. 


To evaluate the interpretability of our method, we visualize the edges been frequently selected for MCI patients and NC group in Fig.~\ref{fig:brain}. We observed that the interactions within sensorimotor cortex (colored \textcolor{blue}{blue}) for MCI patients are stronger than that of NC group. 
This result is consistent with the findings in~\citep{ferreri2016sensorimotor,niskanen2011new} which observed that the motor cortex excitability is enhanced in AD and MCI from the early stages. 
We also observed that the interactions within the frontoparietal network (colored \textcolor{myyellow}{yellow}) of patients are significantly less than that of NC group, which is in line with previous studies~\citep{neufang2011disconnection,zanchi2017decreased} stated that decreased activation in FPN is associated with subtle cognitive deficits.


\begin{table}%\small
    \centering
    \caption{Classification accuracy ($\%$) and standard deviation with different methods over $10$ independent runs. The best and second-best performances are in bold and underlined, respectively.}\label{tab:table_brain}
    \setlength{\tabcolsep}{1mm}{
      \begin{tabular}{@{}ccc@{}}
      \toprule % from booktabs package
      \bfseries Method  & \bfseries EOEC & \bfseries ADNI \\
      \midrule % from booktabs package
        SVM + t-test         & $71.79 \pm 7.80$  & $60.61 \pm 10.52$ \\
        SVM + LASSO          & $72.08 \pm 7.29$ & $54.67 \pm 12.88$ \\
      \midrule
        GCN~\citep{kipf2017semi}    & $68.42 \pm 8.59$ & $\mathbf{66.67} \pm 2.48$ \\
        GAT~\citep{velivckovic2018graph} & $73.68 \pm 8.60$ & $\mathbf{66.67} \pm 9.43$ \\
        %GIN~\citep{xu2019powerful} \\
      \midrule
        \textbf{Graph-PRI}   &  $\mathbf{80.70} \pm 9.60$     &  $\mathbf{66.67} \pm 6.67$\\ 
        \textbf{Graph-PRI*}   &  $\underline{78.95} \pm 4.30$     &  $\underline{64.44} \pm 3.14$\\
      \bottomrule % from booktabs package
    \end{tabular}
    }
\end{table}



\definecolor{braincolor1}{rgb}{0,0,1}
\definecolor{braincolor2}{rgb}{0.8500,0.3250,0.0980}
\definecolor{braincolor3}{rgb}{0.9290,0.6940,0.1250}
\definecolor{braincolor4}{rgb}{0.4940,0.1840,0.5560}
\definecolor{braincolor5}{rgb}{0.4660,0.6740,0.1880}
\definecolor{braincolor6}{rgb}{0.3010,0.7450,0.9330}


\begin{figure}%[t!]
	\centering
	\begin{subfigure}{.45\textwidth}
		\centering
		\includegraphics[width=\textwidth]{MCI_circular.pdf}
		\caption{MCI patients}
	%	\label{fig:openflights_full}
	\end{subfigure}%
	\\
	\begin{subfigure}{.45\textwidth}
		\centering
		\includegraphics[width=\textwidth]{NC_circular.pdf}
		\caption{Normal control group}
	%	\label{fig:openflights_sparse}
	\end{subfigure}
	\\
	\begin{subfigure}{.45\textwidth}
	\centering
	\includegraphics[width=\textwidth]{figures/brain_color.pdf}
	%\caption{Normal control group}
	%\label{fig:openflights_sparse}
	\end{subfigure}
	\caption{The contributing functional connectivity links for (a) MCI patients; and (b) normal control group. We visualize edges with a probability of more than $50\%$ been selected by our generated edges. 
	The colors of neural systems are described as: sensorimotor network (\textcolor{braincolor1}{SMN}), occipital network (\textcolor{braincolor2}{ON}), fronto-parietal network (\textcolor{braincolor3}{FPN}), default mode network (\textcolor{braincolor4}{DMN}), cingulo-opercular network (\textcolor{braincolor5}{CON}), and cerebellum network (\textcolor{braincolor6}{CN}), respectively.}
	\label{fig:brain}
\end{figure}

% See Appendix~\ref{sec:appendix_zoom} for a zoomed plot.


% We apply both DGCNN and graph kernel as baseline classifiers and evaluate their performances in 3-5 benchmark graph classification datasets. The datasets are: MUTAG, PTC, NCI1, PRO-TEINS, D\&D.

% %A useful link \url{https://github.com/muhanzhang/DGCNN}

% %Another benchmark dataset from Norway \url{https://github.com/FilippoMB/Benchmark_dataset_for_graph_classification}


% \begin{table}[!hbpt]
%     \centering
%     \small
% \caption{Relative percentage gain or loss $\frac{r' - r}{r} $ in using the sprase graph in graph classification.}\label{Tab:classification}    
%     \begin{tabular}{ccccccc}
%     \toprule
%         Dataset vs $\beta$ & 0.01 & 0.10 & 0.30 & 0.50 & 0.70 & 1.00 \\ \hline
%         DD & -1.39 & 7.41 & 6.02 & 0.00 & -0.93 & -1.39 \\ \hline
%         deezer\_ego\_nets & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00\\ \hline
%         MSRC\_9 & -1.56 & -1.56 & -1.56 & -1.56 & -1.56 & -2.34 \\ \hline
%         NCI1 & 1.00 & 0.50 & 0.50 & 0.50 & 0.50 & 0.00   \\ \hline
%         OHSU & -8.82 & -8.82 & -8.82 & -5.88 & -5.88 & -8.82   \\ \hline
%         PROTEINS & -1.84 & -1.38 & -1.38 & -1.38 & -0.46 & -0.46  \\ 
%         \bottomrule
%     \end{tabular}
% \end{table}


% \begin{table}[!ht]
%     \centering
% \caption{Sparsification percentage vs $\beta$ in graph classification.}\label{Tab:classification}     
%     \begin{tabular}{ccccccc}
%     \toprule
%         Dataset vs $\beta$  & 0.01 & 0.10 & 0.30 & 0.50 & 0.70 & 1.00  \\ \hline
%         DD & 0.20 & 0.56 & 0.92 & 0.98 & 1.00 & 1.00   \\ \hline
%         deezer\_ego\_nets & 0.16 & 0.18 & 0.25 & 0.33 & 0.43 & 0.58  \\ \hline
%         MSRC\_9 & 0.18 & 0.19 & 0.27 & 0.47 & 0.74 & 0.96   \\ \hline
%         NCI1 & 0.15 & 0.16 & 0.21 & 0.34 & 0.56 & 0.87 \\ \hline
%         OHSU & 0.23 & 0.26 & 0.45 & 0.73 & 0.84 & 0.91  \\ \hline
%         PROTEINS & 0.18 & 0.21 & 0.31 & 0.44 & 0.60 & 0.79 \\ \bottomrule
%     \end{tabular}
% \end{table}


\section{Conclusions}
%We present a preliminary investigation on extending the principle of relevant information to graphs. By introducing a hyper-parameter $\beta$, the graph Laplacian-based PRI is able to sparsify a graph to different extent such that users can flexibly choose a suitable trade-off between the simplicity of the resulting graph and its nearness to the original graph. Properties on specific values of $\beta$ is provided. Experiments suggest that PRI offers a promising avenue to improve the generation error of graph regularization based machine learning algorithms and, at the same time, reduce model complexity.

We present a first study on extending the Principle of Relevant Information (PRI) - a less well-known but promising unsupervised information-theoretic principle - to network analysis and graph neural networks (GNNs). Our Graph-PRI preserves spectral similarity well, while also encouraging the resulting subgraph to have higher graph centrality. Moreover, our Graph-PRI is easy to optimize. It can be flexibly integrated with either multi-task learning or GNNs to improve not only the quantitative accuracy but also interpretability.  

%Operated on the graph Laplacian and its associated incidence matrix, our Graph-PRI is able to learn a sparse graph by balancing the regularity of the resulting subgraph and its closeness to the original one, with a hyperparameter $\beta$. Moreover, it can be simplify optimized by Gumbel-softmax in terms of an edge selection vector. The effectiveness and versatility of Graph-PRI is manifested in two real-world applications involving respectively the multitask regression learning and the interpretable brain network classification.

In the future, we will explore more unknown properties behind Graph-PRI, including a full understanding to the physical meaning of von Neumann entropy on graphs. We will also investigate more downstream applications of Graph-PRI on GNNs such as node representation learning. 

%All in all, we believe our investigations in this paper establish a new bridge between information theory and graphs, and provide new insights to a few down-stream applications in the future. 

\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    The authors would like to thank the anonymous reviewers for constructive comments. The authors would also like to thank Prof. Benjamin Ricaud (at UiT) and Mr. Kaizhong Zheng (at Xi'an Jiaotong University) for helpful discussions. This work was funded in part by the Research Council of Norway (RCN) under grant 309439, and the U.S. ONR under grants N00014-18-1-2306, N00014-21-1-2324, N00014-21-1-2295, the DARPA under grant FA9453-18-1-0039. 
\end{acknowledgements}

%\input{UAI_camera_ready.bbl}
%\bibliographystyle{named}
%{\fontsize{8}{9}\selectfont \bibliography{IT}}
\bibliography{yu_641}


\end{document}
