\section{A generalised Structure Theorem} \label{sec:general}

Our first main result is an improved and generalised structure theorem. It shows that, for any matrix $M$, as long as $\lambda_{k+1} \gg \lambda_1$, any set of $k$ orthonormal vectors with Rayleigh quotient close to the smallest eigenvalue of $M$ must be close to linear combinations of the bottom $k$ eigenvectors of $M$. Furthermore, the bottom $k$ eigenvectors of $M$ will be close to linear combinations of these vectors. 
\pagebreak

\begin{thm} \label{thm:general}
    Let $M \in \mathbb{C}^{N \times N}$ be Hermitian and positive semidefinite with eigenvalues $0 \le \lambda_1 \le \cdots \le \lambda_N$ and corresponding orthonormal basis of eigenvectors $f_1,\dots,f_N$. Let $g_1,\dots,g_k \in \mathbb{C}^N$ be orthonormal and let $\gamma_i \coloneq g_i^* M g_i$ $(1 \le i \le k < N)$.  Then, if $\lambda_{k+1} > \lambda_1 $,there exist 
    \(\hat{g}_1, \hdots, \hat{g}_k \in \spn\{g_1, \hdots, g_k\}\),
    such that  \vspace{-0.4cm}
    \[ 
    \sum_{i=1}^k \|f_i - \hat{g}_i\|^2 \leq \frac{\sum_{i=1}^k \gamma_i - k \lambda_1}{\lambda_{k+1} - \lambda_1}.
    \] \vspace{-0.2cm}
\end{thm}

We observe we can simply recover the structure theorem - Theorem 1 of \citet{macgregor2022tighter} - by choosing  $M = \mathcal{L}$, the normalised Laplacian of an undirected graph $\mathcal{G} = (V,E,w)$, and letting $g_1,\dots,g_k$ be the (normalised) indicator vectors of the clusters achieving the minimum $\rho(k)$. 


\begin{corollary} \label{cor:structure}
    Let \(\mathcal{G}\) be undirected and connected with normalized Laplacian \(\mathcal{L}\).
    Let \(\{S_i\}_{i=1}^k\) be any optimal k-way partition that achieves \(\rho(k)\). For any $1 \le i \le k$, define $\chi_i \in \mathbb{R}^N$ as $\chi_i(u) = 1$ if $u \in S_i$ and $\chi_i(u) = 0$ otherwise. Let \(g_i = \frac{D^{1/2} \chi_i}{\|D^{1/2} \chi_i\|}\). Then, There exist $\hat{g}_{1}, \ldots, \hat{g}_{k} \in \spn\{g_1, \ldots, g_k\}$, such that \vspace{-0.2cm}
    \[
    \sum_{i=1}^{k} \| f_i - \hat{g}_{i} \|^2 \leq \frac{k \rho(k)}{\lambda_{k+1}}.
    \]
\end{corollary}

Informally, Theorem~\ref{thm:general} states that if we choose a matrix representation $M$ of a graph and $g_1,\dots,g_k$ indicator vectors of some clusters, then the bottom $k$ eigenvectors of $M$ must be close to linear combinations of these indicator vectors as long as the Rayleigh quotients of the indicator vectors are significantly smaller than $\lambda_{k+1}$. This is crucial to show that Spectral Clustering works well. This is crucial to show that Spectral Clustering works well, as evidenced by the next lemma, which is adapted from \cite{cucuringu2020hermitian}.

\begin{lem}
\label{lem:kmeans}
Assume the partition $A_1,\dots,A_k$ output by Algorithm~\ref{alg:spectral} is computed by a $(1+\alpha)$-approximation algorithm for $k$-means, with $\tilde{k} = k$. Let $\{S_1, \hdots, S_k\}$ be any $k$-way partition of $V$ achieving $\rho(k)$. Let $G \in \mathbb{C}^{|V| \times \tilde{k}}$ be such that $(D^{-1/2}G)_{u,\colon} = (D^{-1/2}G)_{v,\colon}$ if $u,v\in S_i$ for some $i$. Let $F,\tilde{F}$ be defined as in Algorithm~\ref{alg:spectral}.
For any $i=1,\dots,k$, let $\mu_i = \vol(S_i)^{-1} \cdot \sum_{u \in S_i} d(u) \tilde{F}_{u,\colon}$. Define $\mathcal{D} \coloneq \min_{i,j} \|\mu_i - \mu_j\|$ and $U \coloneq \sum_{u \in V}\|F_{u,\colon}- G_{u,\colon}\|^2$. Assume $U \le (1/5) \mathcal{D}^{-1} (2+\alpha)^{-1} \cdot \min_{i=1,\dots,k} \vol(S_i)$.
Then, the volume of the symmetric difference between $\{S_1,\dots,S_k\}$ and $\{A_1,\dots,A_k\}$ (up to a permutation of the indices) is at most $\mathcal{O}(\frac{(1+\alpha) U}{\mathcal{D}^2})$.
\end{lem}

Notice that, by choosing $ G_{:,i} = \hat{g}_i$ for $i=1,\dots,\tilde{k}$, we have $U = \sum_{i=1}^{\tilde{k}} \|f_i - \hat{g}_i\|^2$. Thus $\mathcal{D}$ is instead dependent on the choice of the indicator vectors. In the case of undirected graph clustering with $M = \mathcal{L}$ and the traditional choice of indicator vectors, as long as $U \ll 1$ (e.g., by Corollary \ref{cor:structure}, whenever $k \rho(k) \ll \lambda_{k+1}$), we have that $\mathcal{D} = \Omega\left(\min_{i=1,\dots,k} \vol(S_i)^{-1}\right)$\citep{macgregor2022tighter}. Assuming a constant-factor approximation algorithm for $k$-means is used, Lemma \ref{lem:kmeans} guarantees that the symmetric difference between the partition output by spectral clustering and the optimal partition is small.


Theorem~\ref{thm:general} together with Lemma~\ref{lem:kmeans} give us a computable way to certify the quality of the partition $A_1,\dots,A_k$ output by spectral clustering compared to the optimal partition $S_1,\dots,S_k$ %(i.e., the one achieving $\rho(k)$)
. More precisely, let $g_1,\dots,g_k$ be a set of orthonormal indicator vectors for $A_1,\dots,A_k$ with Rayleigh quotients $\gamma_i \coloneq g_i^* M g_i$. Compute the value
$
    \frac{1}{k} \frac{\sum_{i=1}^k \gamma_i - k \lambda_1}{\lambda_{k+1} - \lambda_1}.
$
Theorem~\ref{thm:general} and Lemma~\ref{lem:kmeans} ensure that, if this value is small, then $A_1,\dots,A_k$ are close to $S_1,\dots,S_k$ (up to permutation of the indices). 

Besides recovering the original structure theorem, Theorem~\ref{thm:general} is not confined to the normalised Laplacian of undirected graphs. In Section~\ref{sec:digraphs} we will show how to apply Theorem~\ref{thm:general} to Hermitian representations of digraphs. In particular, we do not necessarily need each $g_i$ to be a classical indicator vector: in certain cases (as in digraph clustering) it might be beneficial to choose a different set of ``indicator vectors''. Moreover, in digraphs, $\lambda_1$ might not necessarily be zero: subtracting $\lambda_1$ from both numerator and denominator on the RHS of our bounds can improve them substantially.

\begin{figure}[ht]
    \begin{subfigure}{0.43\textwidth}
    \setlength{\abovecaptionskip}{8pt}   % space above caption
    \setlength{\belowcaptionskip}{0pt}
        \includegraphics[width=\textwidth]{Figures/4GaussianClustersGraphThreshold4.png}
        \caption{Geometric graph}
        \label{fig:4gaussianclustersgraph}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.48\textwidth}
    \setlength{\abovecaptionskip}{8pt}   % space above caption
    \setlength{\belowcaptionskip}{0pt}
        \includegraphics[width=\textwidth]{Figures/4GaussianClusters8Eigenvalues.png}
        \caption{Smallest eigenvalues of the normalised Laplacian}
        \label{fig:4gaussianclusterseigenvalues}
    \end{subfigure}
    
   
    \caption{Four clusters generated by sampling points from a mixture of 4 Gaussians and the corresponding geometric graph (\ref{fig:4gaussianclustersgraph}). Notice how the smallest four eigenvalues come in pairs (\ref{fig:4gaussianclusterseigenvalues}).
    \label{fig:4GaussianClustersFigure}}
\end{figure} 
 
The main limitation of Theorem~\ref{thm:general} and Corollary~\ref{cor:structure} is that they both rely on the gap \(\gamma_i - \lambda_1\) being significantly smaller than \(\lambda_{k+1} - \lambda_1\). This, in practice, is not always satisfied. For example, when using \(M = \mathcal{L}\), we often see multiple gaps in the eigenvalues (which might result in some $\gamma_i$ being considerably larger than $\lambda_1=0$), with Spectral Clustering still working effectively (see, e.g., Figure~\ref{fig:4GaussianClustersFigure}).
To overcome this obstacle, we can recursively apply the following result to 
show that, as long as these gaps in the spectrum are large enough, Spectral Clustering can still be effective.

\begin{thm}[Recursive Structure Theorem] \label{thm:rec}
Let  \(q < k\) and $\hat{g}_1, \hdots, \hat{g}_q \in \spn\{{g}_1, \hdots, {g}_q\}$. % and $\hat{f}_1, \hdots, \hat{f}_q \in \spn\{{f}_1, \hdots, {f}_q\}$. 
Then, there exist 
\(\hat{g}_{q+1}, \hdots, \hat{g}_k \in \spn\{{g}_1, \hdots, {g}_k\}\)
and \(\hat{f}_{q+1}, \hdots, \hat{f}_k \in \spn\{{f}_{1}, \hdots, {f}_k\}\) such that \vspace{-0.2cm}
\[
\sum_{i=q+1}^{k} \|f_i - \hat{g}_i\|^2
 \leq \frac{\sum \limits_{i = q + 1}^{k} \left( \gamma_i  -  \lambda_{q + 1} \right)  + \lambda_{k + 1}  \sum \limits_{i=1}^{q} \|f_i - \hat{g}_i\|^2}{\lambda_{k + 1} - \lambda_{q + 1}} 
\]
The same bound also applies to \(\sum_{i=q+1}^{k} \|\hat{f}_i - g_i\|^2\)
\end{thm}

The idea behind Theorem~\ref{thm:rec} is that, if the indicator vectors can be expressed \emph{mostly} by the eigenvectors \(f_{q+1}, \hdots, f_k\), then by the orthonormality of both sets of vectors, the indicator vectors of \(g_1, \hdots, g_q\) can be expressed \emph{mostly} by \(f_1, \hdots, f_q\). This results in the error term \(\lambda_{k+1} \sum_{i=1}^q \|f_i - \hat{g}_i\|^2\) being small and in exchange we have an improved ratio to that of Theorem 1 provided \(\gamma_i < \lambda_{k+1}\) for \(i = q+1, \hdots, k\).

Theorem \ref{thm:rec} is most appropriate when there are multiple gaps in the eigenvalues of $M$: we split the spectrum into groups of eigenvalues of similar magnitude, with a large gap between each group; we then recursively reapply Theorem~\ref{thm:rec} for each group $\lambda_{q+1},\dots, \lambda_k$ as long as  \(\gamma_i - \lambda_{q+1}  \ll \lambda_{k+1} - \lambda_{q+1}\).



\begin{sproof}
The optimal choice of \(\hat{g}_i\) for \(i = q, \hdots, k\) is the projection of the indicator vectors onto the first \(k\) eigenvectors. Extending the normalised indicator vectors to an orthonormal basis forming orthogonal matrix \(G\), we obtain a orthogonal projection matrix \(Q\) where \(F = GQ^{*}\). The quantity \(\sum_{i=1}^{k}\|f_i - \hat{g}_i\|^2\) can be rewritten using submatrices of \(F, G\) and \(Q\) as shown in the Figure~\ref{fig:matrix_approx}.
\newpage
\begin{figure}[h!]
    \centering
    \includesvg[width=\linewidth]{full_matrix_approx_visualisation_concentrated}
    \caption{Illustration of how \(\sum_{i=1}^{k}\|f_i - \hat{g}_i\|^2\) is formed from orthogonal matrices.}
    \label{fig:matrix_approx}
\end{figure}
We rewrite \(\sum_{i=1}^{k}\|f_i - \hat{g}_i\|^2\) as \(\|F_k - G_kQ_k\|^2_F\) where \(F_k\) and \(G_k\) are the first \(k\) eigenvectors and indicator vectors respectively and \(Q_k\) is the \(k \times k\) top left block of the diagonal of \(Q\). The proof of this theorem utilises this fact and breaks it down further, considering the case when \(Q_k\) has blocks of concentration, see Figure~\ref{fig:concentration_blocks}:
\begin{figure}[h!]
    \centering
    \includesvg[width=0.2\linewidth]{Q_k_annotated}
    \caption{Illustration of how \(Q_k\) might have blocks with higher values.}
    \label{fig:concentration_blocks}
\end{figure}
\newline
The blocks correspond to groups of the indicator vectors that can express, with high accuracy, groups of eigenvectors.
The rows/columns of \(Q\) have unit length so \(\sum_{i=q+1}^{k} \|f_i - \hat{g}_i\|^2\) can be bounded above by \((k-q) - \sum_{i=q+1}^k \sum_{j=q+1}^k |Q_{ij}|^2\). The sum in this expression can then be bounded below as follows where we have again exploited the unit length rows of \(Q\):

\[
\gamma_i = g_i^*Mg_i = \sum_{j=1}^N \lambda_{j} |Q_{ij}|^2 \geq (\lambda_{q + 1} - \lambda_{k + 1}) \sum_{j = q + 1}^{k} |Q_{ij}|^2 + \lambda_{k+1} \left(1 - \sum_{j = 1}^q |Q_{ij}|^2 \right)
\]
 

Rearranging for \(\sum_{j = q + 1}^{k} |Q_{ij}|^2\)  and summing over \(i = q+1, \hdots, k\) provides us with with a lower bound on \(\sum_{i=q+1}^k \sum_{j=q+1}^k |Q_{ij}|^2\). Consequently, this gives an upper bound on \((k-q) - \sum_{i=q+1}^k \sum_{j=q+1}^k |Q_{ij}|^2\) as required. 
\end{sproof}
  

Typically, the choice for the indicator vectors \(g_1, \hdots, g_k\) is \(g_i =\frac{D^{1/2} \chi_i}{\|D^{1/2} \chi_i\|}\) where $\chi_i(u) = 1$ if $u \in S_i$ and $\chi_i(u) = 0$ otherwise.
The only real restrictions on our choice of indicator vectors, however, are that they are orthonormal and that \(\{D^{-1/2}g_i\}_{i=1}^k\) are constant on the clusters. Since the first eigenvector \(f_1\) of \(\mathcal{L}\) (or \(L\)) is the uninformative vector \(f_1 =\frac{D^{1/2} 1}{\|D^{1/2} 1\|}\) (or \(f_1 = \frac{1}{\|1\|}\)), we can choose \(g_1 = f_1\) and then re-orthogonalise \(g_2, \hdots, g_k\). This allows us to prove the following corollary.


\begin{corollary}\label{cor:RemoveFirstEvec}
    Let $\mathcal{G}$ be an undirected graph and $M = \mathcal{L}$. Let $g_1,\dots,g_k \in \mathbb{R}^N$ be orthonormal such that $g_1 = f_1$. Let \(\gamma_i = g_i^*Mg_i\)  \((i = 1, \hdots, k)\). Suppose $\lambda_{k+1} > \lambda_2$. Then, for \(i=1, \hdots, k\), there exists \(\hat{g}_i \in \spn\{g_1, \hdots, g_k\}\) such that \vspace{-0.2cm}
    \[
    \sum_{i=1}^k \|f_i - \hat{g}_i\|^2 \leq \frac{\sum_{i=2}^k (\gamma_i - \lambda_2)}{\lambda_{k+1} - \lambda_2}.
    \]
\end{corollary}

Corollary~\ref{cor:RemoveFirstEvec} essentially states that if the indicator vectors have Rayleigh quotient close to $\lambda_2 \ll \lambda_{k+1}$, then the bottom $k$ eigenvectors of $M$ must be very close to linear combinations of the indicator vectors. In contrast with the original structure theorem, this does not require $\lambda_2$ to be small. 

Corollary~\ref{cor:RemoveFirstEvec} can be used to analyse Spectral Clustering on stochastic block models. While such an analysis can be obtained using standard perturbation arguments, stochastic block models are an egregious instance where the structure theorem of Corollary~\ref{cor:structure} fails. We illustrate the performance of Corollary~\ref{cor:RemoveFirstEvec} on SBMs in the appendix.


\subsection{Experimental results}


To illustrate the impact of Theorem~\ref{thm:rec}, we consider synthetic graphs with a \emph{hierarchical} structure and show the improved performance the theorem provides. We then consider some real-world networks to show that this hierarchical structure is naturally occurring and can be exploited to get a better bound on the performance of Spectral Clustering. 

% \subsubsection{Synthetic networks} \label{subsubsec:StructureSynthExamples}
\paragraph{Geometric random graphs}
We sample \(100\) points each from four two-dimensional Gaussians centred, respectively, \((0,0),(0,5),(d,0),(d,5)\) where \(d\) is a parameter we vary. %The covariance of each distribution is fixed as the identity \(I\) and 
A graph is constructed by assigning an edge between any two points if the Euclidean distance between them is less than a threshold, which we choose in this case to be 4. In Figure~\ref{fig:combined_bounds_models}a we compare our bounds on the distance between indicator vectors of the clusters and the eigenvectors of the Laplacian with the results of \citet{macgregor2022tighter} and the actual distances. Results are averaged over 10 realisations.

\begin{figure}[ht]
    \centering
    \begin{minipage}[t]{0.45\textwidth}
        \centering
        \includegraphics[width=\linewidth]{Figures/BoundsGaussianMixtureModel2PairsDriftingApart.png}
        \caption*{(a)}
    \end{minipage}
    \hfill
    \begin{minipage}[t]{0.45\textwidth}
        \centering
        \includegraphics[width=\linewidth]{Figures/Bounds4Clusters2PairsRST.png}
        \caption*{(b)}
    \end{minipage}

    \caption{(a) Geometric random graph from Gaussian mixture model with varying distance between centres. (b) Stochastic block model with 4 blocks where two pairs have a higher affinity to each other. In both cases, the Error refers the bound given by Theorem \ref{thm:general} and \ref{thm:rec}, and by \citet{macgregor2022tighter} on $\frac{1}{k} \sum_{i=1}^{k} \|f_i - \hat{g}_i\|^2$, together to its actual value. Standard deviation is included as filled error bars.}
    \label{fig:combined_bounds_models}
\end{figure}



Notice that, as the distance parameter \(d\) increases, drawing two pairs of clusters further apart from each other, Theorem~\ref{thm:rec} drastically outperforms Corollary~\ref{cor:structure} and Theorem~\ref{thm:general}.

\paragraph{Stochastic block models}
We consider an SBM with 4 equal-sized clusters $S_1,\dots,S_4$. Let $P_{ij}$ be the probability that, for any $u \in S_i, v \in S_j$ there exists an edge between $u$ and $v$. We set $P_{ii} = 0.5 \,(i=1,\dots,4)$ and $P_{12} = P_{21} = P_{34} = P_{43} = 0.4$. All other values are set equal to \(0.1\). We effectively divide the vertices into two pairs of clusters which are more strongly connected to one another. In Figure~\ref{fig:combined_bounds_models}b, we compare our results with \cite{macgregor2022tighter} and the true distances between indicator vectors of the optimal clusters and the eigenvectors of the Laplacian. Notice that Theorem~\ref{thm:rec} greatly outperforms Corollary~\ref{cor:structure} and Theorem~\ref{thm:general}. Further experiments (such as when the parameters of the stochastic block model are close to the detectability threshold) are available in the Appendix.

\paragraph{Real-world networks}
In Table~\ref{tab:realworldgraphs}, we illustrate the improvement Theorem~\ref{thm:rec} provides over the results of \citet{macgregor2022tighter} on a variety of real-world networks. For MNIST \citep{MNIST}, Fashion MNIST \citep{xiao2017fashion} 
%(both accessed via OpenML \citep{OpenML2013}) 
and the Air Quality dataset \cite{AirQuality}, we constructed a graph from the data by computing a correlation matrix from its data points and assigning edges based on whether the correlation exceeded a pre-defined threshold. Twitch \citep{rozemberczki2021twitch}, LastFM \citep{LastFMfeather}, Athletes \citep{facebook-gemsec} and CA-CondMat \citep{CondMatterCollab} are all network datasets available at SNAP \cite{snapnets}. $N$ (resp. $M$) refers to the number of nodes (resp. edges). The number of clusters $k$ has been chosen so that a relatively large gap between $\lambda_{k}$ and $\lambda_{k+1}$ exists. We include  The improvements given by Theorem~\ref{thm:general} over \cite{macgregor2022tighter} are due to the fact that we sum over the Rayleigh quotients of the $k$ indicator vectors, rather than simply upper-bound this value by $k \rho(k)$ as done in \cite{macgregor2022tighter}. More interesting is, arguably, the improvement of Theorem~\ref{thm:rec}, which is due to many networks having clustered eigenvalues and/or large $\lambda_2$. Notice how Theorem~\ref{thm:rec} is able to certify that, in these examples, the output of spectral clustering is well-correlated with the partition of minimum conductance, something not achievable with previous structure theorems. For added context, we include \(\lambda_{k+1}\) and an approximation of \(\rho(k)\) which we denote as \(\Tilde{\rho}(k)\). The approximation is obtained using the clusters outputted by spectral clustering.
\begin{table}[ht]
    \centering
    \caption{Comparison on the bounds on $\frac{1}{k} \sum_{i=1}^{k} \|f_i - \hat{g}_i\|^2$ given by Theorem~\ref{thm:general} and \ref{thm:rec}, and by~\citet{macgregor2022tighter} on real-world networks. We take a sample for MNIST and Fashion MNIST. }\label{tab:realworldgraphs}
    \begin{tabular}{lrrrccccc}
        \toprule % from booktabs package
         \bfseries Dataset & \bfseries k & \bfseries N & \bfseries M & \bfseries ~\cite{macgregor2022tighter} & \bfseries Theorem~\ref{thm:general} & \bfseries Theorem~\ref{thm:rec} & $\Tilde{\rho}(k)$ & $\lambda_{k+1}$ \\
        \midrule % from booktabs package
        MNIST*        & 5   & 2348    & 140018   & 0.4395  & 0.1915  & 0.1744  & 0.021 & 0.048\\
        Fash. MNIST*  & 6   & 2872    & 1711206  & 0.5614  & 0.3761  & 0.3225  & 0.200 & 0.35\\
        Air Quality           & 3   & 4942    & 2784780  & 0.6234  & 0.2883  & 0.2350  & 0.332 & 0.533\\
        Twitch                & 2   & 168114  & 6797557  & 0.9567  & 0.4793  & 0.3937  & 0.128 & 0.133\\
        LastFM                & 2   & 7622    & 27800    & 0.8373  & 0.4221  & 0.3123  & 0.013 & 0.016\\
        Athletes              & 3   & 13866   & 86852    & 0.6951  & 0.4033  & 0.2427  & 0.053 & 0.037\\
        CA-CondMat            & 3   & 21363   & 91342    & 0.7806  & 0.4818  & 0.3370  & 0.012 & 0.016\\
        \bottomrule % from booktabs package
    \end{tabular}
\end{table}

%         MNIST (sample)        & 5   & 2348    & 140018  & 0.021 & 0.048\\
%         Fash. MNIST (sample)  & 6   & 2872    & 1711206 & 0.200 & 0.356\\
%         Air Quality           & 3   & 4942    & 2784780 & 0.332 & 0.533\\
%         Twitch                & 2   & 168114  & 6797557 & 0.162 & 0.133\\
%         LastFM                & 2   & 7622    & 27800   & 0.066 & 0.016\\
%         Athletes              & 3   & 13866   & 86852   & 0.053 & 0.037\\
%         CA-CondMat            & 3   & 21363   & 91342   & 0.014 & 0.016\\




