\section{Label Propagation-based Families and  Generalization Guarantees}
\label{sec:label_prop}

In this section, we consider three parametric families of label propagation-based algorithms, the classical type of algorithms for semi-supervised learning. 
%
Label propagation algorithms output a soft-label score $F^* \in \mathbb{R}^{n \times c}$, where the $(i,j)$-th entry of $F^*$ represents the score of class $j$ for the $i$-th sample. The prediction for the $i$-th sample is the class with the highest score, i.e. $\text{argmax}_{j \in [c]} F^*_{ij}$. 

Below we describe each family that we considered and their corresponding pseudo-dimension bounds.
Notably, the bounds for all three families of algorithms are  $\Theta(\log n)$, which implies the existence of efficient algorithms with robust generalization guarantees in this setting.

\subsection{Algorithm Families}
We consider three parametric families described below. 

\paragraph{Local and Global Consistency Algorithm Family ($\mathcal{F}_\alpha$)} 
The first family considered is the local and global consistent algorithms~\citep{zhou2003learning}, parameterized by $\alpha \in (0,1)$.
The optimal scoring matrix $F^*$
% of $Q(F)$ has a closed-form solution: 
is defined as
\[F^*_\alpha = (1- \alpha)(I - \alpha S)^{-1} Y, ~~\text{where }S = D^{-1/2}WD^{-1/2}. \]
Here, $S$ is the symmetrically normalized adjacency matrix. 
This score matrix $F^*_\alpha$ corresponds to minimizing the following objective function
$
    \mathcal{Q}(F) 
    = \frac{1}{2} ( \sum_{i, j=1}^nW_{ij} \| \frac{1}{\sqrt{d_{i}}}F_i - \frac{1}{\sqrt{d_{j}}}F_j\|^2
    + \frac{1-\alpha}{\alpha}\sum_{i=1}^n \|F_i - Y_i\|^2 ).
$
% where $Y$ is the $n \times c$ matrix whose rows of unlabeled nodes are all zeros. 
The first term of $Q(F)$ measures the local consistency, i.e., the prediction between nearby points should be similar. The second term measures the global consistency, i.e., consistency to its original label. Therefore, the parameter $\alpha \in (0,1)$ induces a trade-off between the local and the global consistency.
We denote this family as $\mathcal{F}_\alpha$, and the 0-1 losses as $\mathcal{H}_\alpha$.

\paragraph{Smoothing-Based Algorithm Family ($\mathcal{F}_\lambda$)}
This second class of algorithm is parameterized by $\lambda \in (0,+\infty)$~\citep{delalleau2005efficient}. Let $\Delta \in \{0,1\}^{n \times n}$ be a diagonal matrix where elements are 1 only if the index is in the labeled set. The scoring matrix $F^*_\lambda$ is
\[ F^*_\lambda = (S + \lambda I_n \Delta_{i \in L})^{-1} \lambda Y, ~~\text{where} S = D-W. \]
The idea of $\mathcal{F}_\lambda$ is similar to $\mathcal{F}_\alpha$. $\lambda$ is a smoothing parameter that balances the relative importance of the known labels and the structure of the unlabeled points.
% (See Figure~\ref{fig:lambda_graph} and Appendix C).
% Note that by Lemma~\ref{lem:lambda invertibility}, $(S + \lambda I_n \Delta_{i \in L})$ is invertible as long as every connected component in $G$ has at least one labeled node. 
% This condition is reasonable because $G$ is the only information we can use to relate the labeled and unlabeled nodes, and if there is a connected component with no label, then we have no information to label nodes in this component. Therefore, we will assume $(S + \lambda I_n \Delta_{i \in L})$ to be invertible in our analysis. Here we denote the set of $0$-$1$ loss functions corresponding to $\mathcal{F}_\lambda$ as $\mathcal{H}_\lambda$.

\paragraph{Normalized Adjacency Matrix Based Family ($\mathcal{F}_\delta$)}
Here we consider an algorithm family \citep{avrachenkov2012}.
This class of algorithm is parameterized by $\delta \in [0,1]$. The scoring matrix $F^*_\delta$ is 
\[F^*_\delta = (I-c\cdot S)^{-1} Y,
~~\text{where}~~S = D^{-\delta}WD^{\delta - 1}. \] 
Here, $S$ is the (not symmetrically) normalized adjacency matrix and $c \in \mathbb{R}$ is a constant.

This family of algorithms is motivated by $\mathcal{F}_\alpha$ and the family of spectral operators defined in \cite{donnat2023studying}. We may notice that the score matrix $F^*_\delta$ defined here is very similar to $F^*_\alpha$ in the local and global consistency family $\mathcal{F}_\alpha$ when $\alpha$ is set to a constant $c$, whose default value considered in \cite{zhou2003learning} is $0.99$. Here, instead of focusing on the trade-off between local and global consistency, we study the spatial convolutions $S$. With $\delta = 1$, we have the row-normalized adjacency matrix $S = D^{-1}W$. With $\delta = 0$, we have the column-normalized adjacency matrix $S = WD^{-1}$. Finally, with $\delta = 1/2$, we have the symmetrically normalized adjacency matrix that we used in $\mathcal{F}_\alpha$ and many other default implementations~\citep{donnat2023studying, wu19simplifying}.
We denote the set of $0$-$1$ loss functions corresponding to $\mathcal{F}_\delta$ as $\mathcal{H}_\delta$.\looseness-1 

\subsection{Pseudo-dimension Guarantees}

We study the generalization behavior of the three families through pseudo-dimension. The following theorems indicate that all three families have pseudo-dimension $O(\log n)$, where $n$ is the number of data in each problem instance. This result suggests that, all three families of algorithms require $m = O\left(\nicefrac{\log n}{\epsilon^2}\right)$ problem instances to learn a $\epsilon$-optimal algorithmic parameter. We also complement our upper bounds with matching pseudo-dimension lower bound $\Omega(\log n)$, which indicates that we cannot always learn a near-optimal parameter if the number of problem instances is further reduced.

\begin{theorem} \label{thm:alpha upper bound}
    The pseudo-dimension of the Local and Global Consistency Algorithmic Family, $\cF_\alpha$, is $\textsc{Pdim}(\mathcal{H}_\alpha) = \Theta(\log n)$, where $n$ is the total number of labeled and unlabeled data points.
\end{theorem}

Similar to  Local and Global Consistency, we give a tight $\Theta(\log n)$ bound on the pseudo-dimension of the Smoothing-based family of \citet{delalleau2005efficient}.

\begin{theorem}\label{thm:lambda upper bound}
    The pseudo-dimension of the Smoothing-Based Algorithmic Family, $\cF_\lambda$, is $\textsc{Pdim}(\mathcal{H}_\lambda) = \Theta(\log n)$, where $n$ is the total number of labeled and unlabeled data points.
\end{theorem}

Finally, and perhaps more surprisingly, we give the same $\Theta(\log n)$ bound on the pseudo-dimension of the normalized adjacency-matrix based family.

\begin{theorem}\label{thm:delta upper bound}
    The pseudo-dimension of the Normalized Adjacency Matrix-Based Algorithmic Family, $\cF_\delta$, is $\textsc{Pdim}(\mathcal{H}_\delta) = \Theta(\log n)$, where $n$ is the total number of labeled and unlabeled data points.
\end{theorem}

The proofs of the above three theorems follow a similar template. Here, we give an overview of the proof idea. The full proof is in \Cref{appendix:label prop}.

\paragraph{Upper Bound}
First, we investigate the function structure of each index in $F^*$.
For the function classes $\cF_\alpha$ and $\cF_\lambda$, the following lemma is useful.
\begin{lemma}\label{lem:degree of determinant}
     Let $A, B \in \mathbb{R}^{n \times n}$, and $C(x) = (A + xB)^{-1}$ for some $x \in \mathbb{R}$. Each entry of $C(x)$ is a rational polynomial $P_{ij}(x)/Q(x)$ for $i, j \in [n]$ with each $P_{ij}$ of degree at most $n-1$ and $Q$ of degree at most $n$.
\end{lemma}

This lemma reduces each index in the matrix of form $C(x) = (A+xB)^{-1}$ into a polynomial of parameter $x$ with degree at most $n$. By definition, we can apply this lemma to $F^*_\alpha$ and $F^*_\lambda$ and conclude that each index of these matrices is a degree-$n$ polynomial of variable $\alpha$ and $\lambda$, respectively. 

For the algorithm family $\cF_\delta$, the following lemma is helpful:

\begin{lemma}\label{lem: delta F form}
    Let $S = D^{-x} W D^{x-1} \in \mathbb{R}^{n \times n}$, and $C(x) = (I - c \cdot S)^{-1}$ for some constant $c \in (0,1)$ and variable $x \in [0,1]$. For any $i,j \in [n]$, the $i,j$-the entry of $C(x)$ is an exponential
    $C(x)_{ij} = a_{ij} \exp(b_{ij}x)$
    for some constants $a_{ij}, b_{ij}$.
\end{lemma}

By definition of $F^*_\delta$, this lemma indicates that each index of $F^*_\delta$ is a weighted sum of $n$ exponentials of the hyperparameter $\delta$.\looseness-1 
% Therefore, the difference between distinct indices of $F^*$ is also a polynomial, and we can upper bound the pseudo-dimension by studying the number of roots in such polynomials. The detailed proof is presented below.

% Recall that the scoring matrix $F^* \in \mathbb{R}^{n \times c}$ has the following closed form 
% $F^* = (1-\alpha)(I-\alpha S)^{-1} Y,$
% where $S = D^{-1/2}WD^{-1/2}$ for the degree matrix $D$.
% By Lemma~\ref{lem:degree of determinant}, for a fixed problem instance, each $[F^*]_{ij}$ is a rational polynomial in $\alpha$ of the form $P_{ij}(\alpha) / Q(\alpha)$, where $P_{ij}$ and $Q$ are polynomials of degree $n$ and $n$, respectively. 

For $F^*$ being a prediction matrix of any of the above three algorithmic family, recall that the prediction on each node $i \in [n]$ is $\hat y_i = \text{argmax}_{j \in [c]}([F^*]_{ij})$, so the prediction on a node can change only when $\text{sign}([F^*]_{ij} - [F^*]_{ik})$ changes for some classes $j,k \in [c]$. For the families $\cF_\alpha$ and $\cF_\lambda$, $[F^*]_{ij} - [F^*]_{ik}$ is a rational polynomial $(P_{ij}(\alpha) - P_{ik}(\alpha)) / Q(\alpha)$, where $(P_{ij}(\alpha) - P_{ik}(\alpha))$ and $Q(\alpha)$ are degree of at most $n$ (we can simply replace $\alpha$ with $\lambda$ for $\cF_\lambda$). Therefore, its sign can only change at most $O(n)$ times. For the family $\cF_\delta$, we refer to the following lemma and conclude that the sign of $F^*_{ij}-F^*_{ik}$ can only change at most $O(n)$ times as well.
\begin{lemma} \label{lem:roots of exp sum}
    Let $a_1, \ldots, a_n \in \mathbb{R}$ be not all zero, $b_1, \ldots, b_n \in \mathbb{R}$, and $f(x) = \sum_{i=1}^n a_i e^{b_i x}$. The number of roots of $f$ is at most $n-1$.
\end{lemma}

Therefore, for all three families, the prediction on a single node can change at most $\binom{c}{2}O(n) \in O(nc^2)$ times as the hyperparameter varies.
For $m$ problem instances, each of $n$ nodes, this implies we have at most $O(mn^2c^2)$ distinct values of the loss function. The pseudo-dimension $m$ then satisfies $2^m \leq O(mn^2c^2)$, which implies $\textsc{Pdim}(\mathcal{H}_{\alpha}) = \textsc{Pdim}(\mathcal{H}_{\lambda}) = \textsc{Pdim}(\mathcal{H}_{\delta}) = O(\log n)$. 

\paragraph{Lower Bound}
% The full proof is in Appendix~\ref{appendix:alpha lower bound}.
Our proof relies on a collection of parameter thresholds and well-designed labeling instances that are shattered by the thresholds. Here we present the proof idea of pseudo-dimension lower bound of the family $\cF_\alpha$. The analysis for $\cF_\lambda$ and $\cF_\delta$ depends on a similar construction. 

We first describe a hard instance of $4$ nodes, using binary labels $a$ and $b$. We have two points labeled $a$ (namely $a_1, a_2$), and one point labeled $b$ (namely $b_1$) connected with both $a_1$ and $a_2$ with edge weight $1$. We also have an unlabeled point $u$ connected to $b_1$ with edge weight $x \geq 0$. That is, the affinity matrix and initial labels are $$W = \begin{bmatrix}
    0 & 1 & 1 & x\\
    1 & 0 & 0 & 0\\
    1 & 0 & 0 & 0\\
    x & 0 & 0 & 0
    \end{bmatrix}, Y = \begin{bmatrix}
    1 & 0 \\
    0 & 1 \\
    0 & 1 \\
    0 & 0 
    \end{bmatrix}.$$
With this construction, the prediction on node $u$ changes and only change when $\alpha = \frac{(x+2)^{1/2}}{2}$. For any $\beta \in [0,1]$ and let $x = 4\beta^2 - 2 \geq 0$, then $\hat y_4 = 0$ when $\alpha < \beta$ and $\hat y_4 = 1$  when $\alpha \geq \beta$.

Now we can create a large graph of $n$ nodes, consisting of $n/4$ connected components as described above. We assume $4$ divides $n$ for simplicity. Given a sequence of $\alpha$'s such that $0 < \alpha_0 < 1/\sqrt{2} \leq  \alpha_1 < \alpha_2 < ... < \alpha_{n/4} < 1$, we can create the $i$-th connected component with $x = 4\alpha_i^2 -2$. Now the predicted label of the unlabeled node in the $i$-th connected component is $0$ when $\alpha < \alpha_i$ and $1$ when $\alpha \geq \alpha_i$. By alternatively labeling these unlabeled nodes, the $0$-$1$ loss of this problem instance fluctuates as $\alpha$ increases.
    
Finally, by precisely choosing the subsequences so that the oscillations align with the bit flips in the binary digit sequence, we can construct m instances that satisfy the $2^m$ shattering constraints.
\begin{figure}[ht]
    \centering
    \includegraphics[width=0.5\textwidth]{lp_instance.png}
    \caption{An illustration of the construction of the problem instance in the lower bound proof. 
    % Within each connected component, the grey node (4) is unlabeled, and the red (1) and blue (2,3) nodes are two different classes. Node 1 is connected with nodes 2 and 3 with an edge weight of 1, and node 1 is connected with 4 with an edge weight of $x$.
    }
    \label{fig:instance_graph}
\end{figure}
%
\begin{remark}
    We reiterate the implications of the above three theorems. All three families have pseudo-dimension $\Theta(\log n)$. This indicates that all three families of algorithms require $m = O\left(\nicefrac{\log n}{\epsilon^2}\right)$ problem instances to learn an $\epsilon$-optimal hyperparameter. 
\end{remark}
