\section{Algorithm Development}
Before delving into the algorithm development, we will explore the concept of the node profile matrix \citep{ghoroghchian2021graph} in the subsequent section.
\subsection{Node Profile Matrix} \label{nodelabelmatrix}
Given an original graph $\mathcal{G}(\Theta, X, Y)$, the objective of graph coarsening is to learn a coarsened graph $\mathcal{G}_c$ through the learning of the mapping matrix $C$. The quality assessment of this coarsened graph is often quantified using the node profile matrix.

The node profile matrix of a coarsened graph, denoted as $\phi$, is defined as :
\begin{equation}
    \phi = C^TY
\end{equation}
Here, \(Y \in \mathbb{R}_+^{p \times l}\) represents the one-hot label matrix of the original graph. In the matrix \(\phi\), each non-zero entry \(\phi_{ij}\) signifies the count of nodes from the original graph with the \(j^{th}\) label that are mapped to the \(i^{th}\) supernode in the coarsened graph \(\mathcal{G}_c\). A balanced mapping is characterized by the sparsity of each row in the \(\phi\) matrix, indicating that nodes with similar labels from the original graph are effectively mapped into a supernode of the coarsened graph.






\begin{figure}[h!]
    \centering
    \includegraphics{manoj-Page-3.png}
    \caption{This illustration illustrates that, for a given original graph \(\mathcal{G}\), there exist numerous possibilities for the coarsened graph. Notably, it is discernible that the coarsened graph \(\mathcal{G}_{c2}\) stands out as a more informative representation as similar label nodes get mapped to the same supernode, making it particularly well-suited for downstream tasks.}
    \label{toyexample}
\end{figure}
 Let's consider an toy example in the Figure \ref{toyexample} involving an original graph $\mathcal{G}$ and examine two coarsened graphs, each with its associated $\phi$ matrix:
$$
[\phi_1]=
\begin{bmatrix}
  2& 1 & 0  \\
 1& 1 & 0   \\
 0& 1 & 0  \\
 1& 0 & 3  \\
 0& 0 & 1 
\end{bmatrix}
\hspace{0.5cm}\phi_2 =
\begin{bmatrix}
 3 & 0 & 0 \\
 2 & 0 & 0\\ 
 0 & 3 & 0 \\
 0 & 0 & 3\\
 0 & 0 & 1
\end{bmatrix}
$$

% It is evident that the $\phi_2$ is more sparser than $\phi$ which implies that nodes having similar labels are mapping to the same supernode which further implies that the coarsened graph corresponding to the $\phi_2$ matrix is more informative than the coarsened graph corresponding to the $\phi_1$ matrix and hence more suitable to perform downstream task considering the coarsened graph inplace of original graph.
The increased sparsity observed in $\phi_2$ compared to $\phi_1$ is indicative of a more pronounced trend: nodes with similar labels are consistently mapped to the same supernode. This, in turn, suggests that the coarsened graph corresponding to $\phi_2$ encapsulates a more focused and distinctive representation of the original graph compared to the coarsened graph associated with $\phi_1$. Consequently, when contemplating downstream tasks, leveraging the coarsened graph derived from $\phi_2$ proves more advantageous, as it is not only more informative but also specifically tailored to capture the essential structural and label-related characteristics of the original graph.

Furthermore, current graph coarsening techniques face limitations in effectively learning coarsened graphs when the associated $\phi$ matrix is sparse. This constraint hinders their suitability for downstream tasks, particularly node classification using the coarsened graph. Additionally, these existing methods often overlook the label information inherent in the original graph during the coarsening process. Consequently, the resulting coarsened graphs lack crucial information, leading to suboptimal performance in downstream tasks. To address these challenges, there is a need for more advanced graph coarsening techniques that incorporate graph matrix, feature matrix and  label information of the original graph while doing the coarsening such that the learned coarsened graph is more informative and having sparse $\phi$ matrix.

In the subsequent section, we introduce the first optimization-based approach that consider graph matrix, feature matrix and label matrix of the original graph as inputs, aiming to learn a more informative coarsened graph characterized by a sparse $\phi$ matrix. Notably, during the coarsening process, we selectively taken the information of label in a semisupervised manner. 

\subsection{Proposed Formulation}
Given a graph \(\mathcal{G}(\Theta \in \mathbb{R}^{p \times p}, X \in \mathbb{R}^{p \times n}, Y)\), where \(Y \in \{0, 1\}^{p \times l}\), the label matrix \(Y\) follows a binary encoding, with \(\textbf{y}_i:\) representing the corresponding one-hot indicator vector if node \(v^i\) is labeled; otherwise, \(\textbf{y}_i: = 0\) for unlabelled nodes in a semi-supervised fashion. The proposed formulation for learning a coarsened graph, emphasizing sparsity in the \(\phi\) matrix, is as follows:
\begin{gather}
  \min_{\Theta_c, \tilde{X}, {C}} -\gamma \text{log det}(\Theta_c +J)+\text{tr}(\tilde{X}^{T}\Theta_c\tilde{X})\label{FGC formulation1}\\ \nonumber  + \beta h(\Theta_c)+ \frac{\lambda}{2} g(C) + r(C, Y)  \\ \nonumber
 \text{s.t.}\;\; \ C \geq 0, \ \Theta_c=C^T\Theta C, \ X= C\tilde{X}, \ \Theta_c \in \mathcal{S}_{\Theta}, C \in \mathcal{C}
\end{gather}
In this work, we have opted for \(r(C, Y) = \|C^TY\|_F^2\) as our guiding function. This particular formulation is designed to enforce sparsity within the \(\phi\) matrix of the coarsened graph. Furthermore, the term $-\text{log det}(\Theta_c + J)$  ensure the connectedness of the coarsened graph where, $J=\frac{1}{k}\textbf{1}_{k \times k}$ is a rank $1$ matrix with each element equals to $\frac{1}{k}$. On the other hand, the original hard constraint $X=C\tilde{X}$ poses challenges in optimization. To address this, we relax $X=C\tilde{X}$ to $\|C\tilde{X}-X\|_F^2$ and introduce regularizers $h(\Theta_c) = \|\Theta_c\|_F^2$ and $g(C) = \|C^T\|_{1,2}^2$. Putting $\Theta_c=C^T\Theta C$ in equation \eqref{FGC formulation1} three-variable optimization problem converted into two variable optimization problem as:




\begin{gather}\label{Main formulation}
  \min_{\tilde{X}, {C}} -\gamma \text{log det}(C^T\Theta C +J)+\text{tr}(\tilde{X}^{T}C^T\Theta C\tilde{X}) \\ \nonumber +\frac{\alpha}{2}||C \tilde X-X||_F^2 +\frac \lambda 2 \|C^T\|_{1,2}^2 + \frac{\beta}{2}\|C^T\Theta C\|_F^2+ \\ \nonumber \frac{\delta}{2}\|C^TY\|_F^2 \\ \nonumber
 \text{s.t.}\;\;\mathcal{S}_C =\left\{ C \geq 0| \ \|[C^T]_i\|_2^2 \leq 1 \ \forall \ i=1,.., p \right\} \label{Loading matrix-set}
\end{gather}

where, term $\frac{\beta}{2}\|C^T\Theta C\|_F^2$ is incorporated to enforce sparsity in the learned coarsened graph. Meanwhile, the term $\frac{\delta}{2}\|C^TY\|_F^2$ plays a crucial role in promoting sparsity within the $\phi$ matrix. This sparsity condition ensures that nodes sharing the same label are consistently mapped to the same supernode, thereby enhancing the coherence and consistency of the mapping process.

The proposed formulation \eqref{Main formulation} is a non-convex optimization problem when considering all variables simultaneously. However, the problem transforms into a convex optimization problem when isolating one variable at a time, treating the remaining variables as constants. Our objective is to address this problem iteratively using a block successive upper bound minimization (BSUM) approach and develop a block MM-based algorithm. This algorithm updates one variable at a time while keeping the other constant, leading to a more manageable and convergent optimization process for the variables $(\tilde{X},C)$.
\vspace{-0.1cm}
\subsection{Update of \textit{C}}
% \subsection{\texorpdfstring{Update of C}}
\vspace{-0.1cm}
When considering $C$ as a variable and holding $\tilde X$ constant, the resulting sub-problem for $C$ can be expressed as follows:
\begin{gather}\label{UpdateC}
 \min_{C \in \mathcal{S}_c}f(C)=-\gamma \text{log det}(C^T\Theta C +J)+ \frac \lambda 2 \|C^T\|_{1,2}^2\\ \nonumber + \frac \alpha 2 \|C\tilde{X} - X\|^2_F  +\text{tr}(\tilde{X}^{T}C^T\Theta C\tilde{X})+ \frac{\beta}{2}\|C^T\Theta C\|_F^2\\ \nonumber+ \frac{\delta}{2}\|C^TY\|_F^2 \nonumber
 % \text{s.t.}\;\mathcal{S}_C =\left\{ C \geq 0| \ \|[C^T]_i\|_2^2 \leq 1 \ \forall \ i=1,.., p \right\}
\end{gather}
% \begin{align}\label{Loading matrix-set}
% \text{s.t.}\;\mathcal{S}_C =\left\{ C \geq 0| \ \|[C^T]_i\|_2^2 \leq 1 \ \forall \ i=1,.., p \right\}
% \end{align}
The functions $-\gamma \log \det(C^T\Theta C + J)$, $\frac{\lambda}{2} \|C^T\|_{1,2}^2$, $\frac{\alpha}{2} \|C\tilde{X} - X\|_F^2$, and $\text{tr}(\tilde{X}^TC^T\Theta C\tilde{X})$ are all convex functions \citep{kumar2023unified}. Additionally, the terms $\frac{\beta}{2}\|C^T\Theta C\|_F^2$ and $\frac{\delta}{2}\|C^TY\|_F^2$ involve Frobenius norms, rendering them convex functions as well. Considering the set $\mathcal{S}_c$ as a closed convex set, it can be asserted that the optimization problem \eqref{UpdateC} is strictly convex.

By using the first-order Taylor series approximation, a majorised function for $f(C)$ at $C^{(t)}$ can be obtained as \citep{inbook, article, 7547360}:
\begin{equation}\label{Lipschitzequation}
 g(C|C^{(t)})=f(C^{(t)})+(C-C^{(t)})\nabla f(C^{(t)})+\frac{L}{2}||C-C^{(t)}||^2 
\end{equation}
where $f(C)$ is $L-$Lipschitz continuous gradient function $L=\max(L_1,L_2, L_3,L_4,L_5,L_6)$ with $L_1,L_2, L_3,L_4,L_5,L_6$ the Lipschitz constants of $-\gamma \text{log det}(C^T\Theta C +J)$, $\text{tr}(\tilde{X}^{T}C^T\Theta C\tilde{X})$, $\|C\tilde{X} - X\|^2_F$, $\|C^T\|_{1,2}^2$, $\frac{\beta}{2}\|C^T\Theta C\|_F^2$, $\frac{\delta}{2}\|C^TY\|_F^2$ respectively. After ignoring the constant term, the majorised problem of \eqref{UpdateC} is
\begin{align}\label{UpdateC1}
  \underset{C \in \mathcal{S}_c}{\text{\text{minimize}}} \quad \frac{1}{2}C^TC-C^TA
\end{align}
where $A=C^{(t)}-\frac{1}{L}\nabla f(C^{(t)})$ and
$ \nabla f(C^{(t)})=-2\gamma \Theta C^{(t)} (C^{(t)^T}\Theta C^{(t)}+J)^{-1}+\alpha \left(C^{(t)}\tilde{X} - X\right)\tilde{X}^{T} +2\Theta C^{(t)}\tilde{X}\tilde{X}^T+ \lambda C^{(t)}\pmb{1}+ \beta\Theta C C^T\Theta C + \delta Y (C^\top Y)^\top$ where $ \pmb 1 $ is all ones matrix of dimension ${k \times k}$.   

\begin{Lem1}\label{lemma1}
By using KKT optimality condition we can obtain the optimal solution of \eqref{UpdateC1} as
\begin{equation}\label{eqn:C}
  C^{(t + 1)} = \left(C^{(t)} - \frac{1}{L}\nabla f\left(C^{(t)}\right)\right)^+
  \end{equation}
where $(X_{ij})^{+}=\max(\frac{X_{ij}}{\|[X^{T}]_{i}\|_{2}},0)$ and $[X^{T}]_{i}$ is the $i$-th row of matrix $X$.
\end{Lem1}

{\it Proof:} The proof is deferred to the Appendix \ref{prooflemma1}.

\vspace{-0.1cm}
% \subsection{Update of $\tilde{X}$}
\subsection{\texorpdfstring{Update of $\tilde{X}$}{Update of X}}

\vspace{-0.1cm}
Fixing $C$, we obtain the following problem for $\tilde{X}$: 
\begin{equation}\label{problemXtilde}
  \min_{\tilde{X}} f(\tilde{X})= \text{tr}(\tilde{X}^TC^T\Theta C \tilde{X}) + \frac \alpha 2 \|C\tilde{X} - X\|^2_F
\end{equation}
The problem \eqref{problemXtilde} is a strongly convex optimization problem as $C^T\Theta C$ and $C^TC$ are  the positive semi-definite and definite matrices, respectively. The closed form solution of problem \eqref{problemXtilde} can be obtained by setting the gradient to zero, i.e., $2C^T\Theta C \tilde{X} + \alpha C^T(C\tilde{X} - X) = 0$, we get
\begin{equation}\label{eqn:X}
  \tilde{X}^{t+1} =\left(\frac 2 \alpha C^T\Theta C + C^TC\right)^{-1} C^TX
\end{equation}

\begin{algorithm}
\SetAlgoLined
\SetAlCapFnt{\footnotesize}
\SetAlCapNameFnt{\footnotesize}
 \caption{\textsf{LAGC Algorithm}} 
 \label{Algorithm}
 \KwIn{$\mathcal{G}(X,Y,\Theta), \alpha, \gamma, \lambda, \beta, \delta$}
 $t \leftarrow 0;$ \\
		\While {stopping criteria not met}{
     Update $C^{t+1}$ and $\tilde{X}^{t+1}$ as in \eqref{eqn:C} and \eqref{eqn:X} respectively.\\
  $t \leftarrow t+1;$ \\
 }
  \KwOut{$C$, $\Theta_c$, and $\tilde{X}$}
\end{algorithm}
\begin{figure*}
    \centering
    \includegraphics[scale=2.1]{IMG_LAGC_FRAMEWORK.png}
    \caption{
The diagram above illustrates the sequence of steps in performing node classification task using a coarsened graph. Given an original graph \(\mathcal{G}(\Theta, X, Y)\) where some of the node labels are known, we employ the LAGC algorithm  learn a coarsened graph characterized by a sparser \(\phi\) matrix. The resulting coarsened graph is denoted as \(\mathcal{G}_c(\Theta_c, \tilde{X}, \tilde{Y}) \) where, $\tilde{Y}=\text{argmax}(C^{\dagger}Y)$. Subsequently, the coarsened graph \(\mathcal{G}_c\) is utilised to train a Graph Neural Network (GNN). Subsequently, the trained GNN is evaluated by predicting the labels of nodes in the original graph for which labels were not initially known.}
    \label{fig:enter-label}
\end{figure*}
It is noteworthy that the label matrix $Y$ is structured as a one-hot matrix, representing the labels of the nodes in the original graph 
$\mathcal{G}$ in a semi-supervised fashion. Each row of $Y$ corresponds to a node, and the one-hot encoding signifies the presence of a specific label for that node. 

\begin{thm}\label{convergence}
 The sequence $\{ C^{(t)}, \tilde{X}^{(t)}\}$ generated by Algorithm \ref{Algorithm}
 converges to the set of Karush–Kuhn–Tucker (KKT) points of Problem \eqref{Main formulation}.
\end{thm}
{\it Proof:} The proof is deferred to the Appendix \ref{prooftheorem}.


