
\section{Background and Problem Formulation}
In this section, we review the basics of graph and graph
coarsening.

\subsection{Graph}
A graph with features and labels is represented as $\mathcal{G} = (V, E, A, X, Y)$, where $V = \{v^1, v^2, . . . , v^p\}$ denotes the vertex set, $E \subseteq V \times V$ is the edge set, and $A \in \mathbb{R_+^{p\times p}}$ stands for the adjacency (weight) matrix for a graph having $p$ number of nodes. Each non zero entry $A_{ij}$ represents the edge between the $i^{th}$ and $j^{th}$ nodes. Furthermore $X \in \mathbb{R}^{p \times n} = [\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_p]^\top$ is a feature matrix, where each row vector $\mathbf{x}_i \in \mathbb{R}^{n}$ represents the feature vector associated with one of the $p$ nodes of the graph $\mathcal{G}$. Moreover, in semisupervised learning, label information is given by \(Y \in \{0, 1\}^{p \times l}\), where if node \(v^i\) is labelled, then \(\textbf{y}_i:\) represents the corresponding one-hot indicator vector; otherwise, \(\textbf{y}_i: = 0\) for unlabelled data. In the general, a graph is  represented by either an adjacency matrix or a Laplacian matrix. A matrix \(\Theta \in \mathbb{R}^{p \times p}\) is identified as a combinatorial Laplacian matrix when it belongs to the following set \citep{kumar2020unified}:

\begin{align}\label{Lap-set}
\hspace{-1em}\mathcal{S}_{\Theta} =\big\{\Theta_{ij} =\Theta_{ji} \leq 0 \ {\rm for} \ i\neq j; \Theta_{ii}=-\sum_{j\neq i}\Theta_{ij} \big\}.
\end{align}

Moving forward, the relationship between the adjacency matrix \(A\) and the combinatorial Laplacian matrix is defined as \(A_{ij} = -\Theta_{ij}\) for all \(i \neq j\), and \(A_{ij} = 0\) for \(i = j\). Highlighting the advantages of the Laplacian matrix \(\Theta\) over the adjacency matrix \(A\), \(\Theta\) possesses key properties such as being a positive semidefinite matrix, a symmetric matrix, and having zero row sums. In the subsequent subsection, we will delve into a discussion on graph learning from data.

\subsection{Graph learning from data}
Given the data $X=[\mathbf{x}_1,. . .,\mathbf{x}_p]^T$, a connected and smooth graph can be obtained by solving the following optimization problem \citep{pmlr-v51-kalofolias16}:

\begin{gather}\label{GMRF}
  \min_{\Theta \in \mathcal{S}_{\Theta}} - \gamma \log(\det(\Theta + J))+\text{tr}(X^T\Theta X)+ \beta h(\Theta) 
\end{gather}

where, $\Theta \in \mathbb{R}^{p \times p}$ represents the target Laplacian matrix, and $\mathcal{S}_{\Theta}$ is the set of Laplacian matrices as defined in \eqref{Lap-set}. The term $\text{tr}(X^T\Theta X)$ represents the smoothness or energy  of the graph, and minimizing it  signifies that nodes with similar features will have higher edge weights. Next, $\beta$ is a hyperparameter, and the regularizer $h(\Theta)$ enforces desired properties e.g. sparsity in the coarsened graph. Ensuring the connectedness of the graph requires maintaining the rank of $\Theta$ as p-1. This is achieved through the term $- \gamma \log(\det(\Theta + J))$, where $J=\frac{1}{p}\textbf{1}_{p \times p}$ is a rank-1 matrix with each element equal to $\frac{1}{p}$. The addition of $J$ to $\Theta$ ensures a full-rank matrix without altering the row and column space of $\Theta$. Next, in the subsequent subsection, we will delve into the discussion of graph coarsening.










\subsection{Graph Coarsening}
The objective of graph coarsening is to learn a smaller, more tractable graph $\mathcal{G}_c(\Theta_c, \tilde{X}, \tilde{Y})$ while preserving the properties of the original graph $\mathcal{G}(\Theta, X, Y)$. Where, $\Theta_c \in \mathbb{R}^{k \times k}$ is the Laplacian matrix, $\tilde{X} \in \mathbb{R}^{k \times n}$ is the feature matrix, $\tilde{Y} \in \mathbb{R}^{k \times l}$ is the label matrix of the coarsened graph. The relation between, $\Theta$ and $\Theta_c$, $X$ and $\tilde{X}$, $Y$ and $\tilde{Y}$ are given by,
\begin{equation}
    \Theta_c=C^T\Theta C, \hspace{0.25cm} X=C\tilde{X},\hspace{0.25cm} \tilde{Y}=\text{argmax}(C^{\dagger}Y)
\end{equation}
Where $C \in \mathbb{R}_+^{p \times k}$ is the mapping matrix that maps the $p$ number of nodes of original graph to $k$ number of nodes of the coarsened graph. Also, each non zero entry of $C$ i.e. $C_{ij}$ indicate $i^{th}$ node of original graph get mapped to the $j^{th}$ super node of the coarsened graph. For a balanced mapping, the mapping matrix must belong to the following set:
\begin{gather}
\mathcal{C} =\Big\{ C\geq 0|\ \langle C_i, C_j \rangle=0 \ \forall \; i\neq j,
 \quad \langle C_i, C_i \rangle=d_i,\nonumber\\ \norm{C_i}_0\geq 1 \ \text{and} \ \norm{[C^{\top}]_i}_0= 1 \Big\} \label{mappingmatrix-set1}
\end{gather}

\noindent{\textbf{Problem Statement:} 
Given an original graph \(\mathcal{G}(\Theta, X, Y)\), our objective is to learn a  coarsened graph \(\mathcal{G}_c(\Theta_c, \tilde{X}, \tilde{Y})\).}


Several graph coarsening techniques have been developed for learning  the mapping matrix $C$. The heuristic method proposed by \citep{loukas2019graph} focuses solely on the Laplacian matrix $\Theta$ to derive the mapping matrix $C$. In contrast, \citep{jin2021graph} is a deep learning based method leveraging graph neural networks for learning a condensed graph. A more recent and comprehensive optimization-based method is introduced by \citep{kumar2023unified}. This method not only considers the Laplacian matrix $\Theta$ but also incorporates the feature matrix $X$ for the learning of the mapping matrix $C$. 

In semi-supervised learning, where some node label information is available, existing state-of-the-art methods often neglect the label information during the coarsening process or, equivalently, while learning the mapping matrix \(C\). This oversight in utilizing the label matrix might result in the learning of a less informative coarsened graph, rendering it unsuitable for downstream tasks and potentially undermining the purpose of the coarsening process. In response, we introduce the first framework that incorporates the feature matrix \(X\), label matrix \(Y\) in which some node labels are known, and Laplacian matrix \(\Theta\) of the original graph as inputs in the learning of a coarsened graph. The proposed formulation is:
\vspace{-0.5cm}
\begin{gather}\label{FGC formulation}
  \min_{\Theta_c, \tilde{X}, {C}}  f(\Theta_c, \tilde{X}) + \beta h(\Theta_c)+ \frac{\lambda}{2} g(C)  + r(C, Y)\\ \nonumber
 \text{s.t.}\;\; \ C \geq 0, \ \Theta_c=C^T\Theta C, \ X= C\tilde{X}, \ \Theta_c \in \mathcal{S}_{\Theta}, C \in \mathcal{C}
\end{gather}
where, $f(\Theta_c, \tilde{X})$ is a graph fitting term, for example $f(\Theta_c, \tilde{X})=\text{tr}(\tilde{X}^T \Theta_c \tilde{X})$ represents the smoothness of the graph. Subsequently, the regularizer $h(\Theta_c)$  is applied to the Laplacian matrix to ensure crucial properties in the graph. For instance,  $h(\Theta_c)=\|C^T\Theta C\|_F^2$ is employed to ensure sparsity in the resulting coarsened graph. This regularization term contributes to shaping the graph by imposing constraints that lead to a more structured and meaningful representation. Additionally, the regularizer $g(C)$ is imposed on the mapping matrix $C$ to enforce desired properties outlined in $C$ as defined in \eqref{mappingmatrix-set1}. 

Subsequently, the function \(r(C, Y)\) plays a pivotal role in our approach, incorporating the label matrix \(Y\) of the original graph and the mapping matrix \(C\). This function maps nodes with similar labels in the original graph to a supernode in the coarsened graph. The careful selection of the function \(r(C, Y)\) is paramount as it directly influences the quality of the coarsened graph. Determining the appropriate function is a challenging task. In the next subsection, we will delve into the algorithm development.


% Graph coarsening serves a pivotal purpose in expediting downstream tasks by providing a more concise and efficient representation of the graph. The core motivation is to replace the use of the original graph with its coarsened counterpart, leading to a substantial reduction in training time and storage space. The coarsened graph retains essential properties of the original graph.  In essence, graph coarsening acts as a computational shortcut, enabling more expedient training and execution of algorithms in subsequent tasks processing in downstream tasks.

% For a given graph $\mathcal{G}$, there are various possibilities of the coarsened graph. To quantify the quality of the coarsened graph to perform the downstream task like node classification, a node profile matrix is defined by \cite{ghoroghchian2021graph}.