\section{Preliminaries}

\paragraph{Notations.}
Throughout this paper, $f(n) = O(g(n))$ denotes that there exists a constant $c > 0$ such that $|f(n)| \leq c|g(n)|$. $f(n) = \Omega(g(n))$ denotes that there exists a constant $c > 0$ such that $|f(n)| \geq c|g(n)|$. The indicator function is indicated by $\mathbb{I}$, taking values in $\{0, 1\}$.
In addition, we define the shorthand $[c] = \{1, 2, \ldots, c\}$. For a matrix $W$, we denote its Frobenius norm by $\|W\|_F$ and spectral norm by $\|W\|$. We also denote the Euclidean norm of a vector $v$ by $\|v\|$.\looseness-1 

\paragraph{Graph-based Semi-supervised Learning.}
We are given $n$ data points, where some are labeled, denoted by $L \subseteq [n]$, and the rest are unlabeled. We may also have features associated with each data point, denoted by $z_i \in \mathbb{R}^d$ for $i \in [n]$. We can construct a graph $G$ by placing (possibly weighed) edges $w(u,v)$ between pairs of data points $u,v$. The created graph $G$ is denoted by $G = (V, E)$, where $V$ represents the vertices and $E$ represents the edges. Based on $G$, we can calculate $W \in \mathbb{R}^{n \times n}$ as the adjacency matrix, i.e., $W_{ij} = w(i,j)$.
We let $D \in \mathbb{R}^{n \times n}$ be the corresponding degree matrix, so $D = \text{diag}(d_1, \ldots, d_n)$ where $d_i = \sum_{j \in [n]} w(i,j)$.

For a problem instance of $n$ data points, we define input $X$ as $X=(n, \{z_i\}_{i=1}^n, L, G)$, or $X=(n, L, G)$ if no features are available. We denote the label matrix by $Y \in \{0,1\}^{n \times c}$ where $c$ is the number of classes. Throughout the paper, we assume $c = O_n(1)$, i.e. $c$ is treated as a constant with respect to $n$, which matches most practical scenarios.
Here, $Y_{ij} = 1$ if data point $i \in L$ has label $j \in [c]$ and $Y_{ij} = 0$ otherwise. The goal is to predict the labels of the unlabeled data points. 

An algorithm $F$ in this setting may be considered as a function that takes in $(X, Y)$ and outputs a predictor $f$ that predicts a label in $[c]$ for each data.
We denote $f(z_i)$ as our prediction on the $i$-th data.
To evaluate the performance of a predictor $f$, we use $0$-$1$ loss (i.e. the predictive inaccuracy) defined as $\frac{1}{n}\sum_{i=1}^n \ell_{0-1}\left(f(z_i),y_i\right) = \frac{1}{n}\sum_{i=1}^n \mathbb{I}[f(z_i) \neq y_i].$ In this work, we are interested in the generalizability of an algorithm $F$ on $0$-$1$ loss. 

\paragraph{Hyperparameter Selection.}
We consider several \emph{parameterized families} of classification algorithms. Given a family of algorithms $\mathcal{F}_{\rho}$ parameterized by some parameter $\rho$, and a set of $m$ problem instances $\{(X^{(k)}, Y^{(k)})\}_{k=1}^m$ i.i.d. generated from the data  distribution $\mathcal{D}$ of the input space $\mathcal{X}$ and the label space $\mathcal{Y}$, our goal is to
select a parameter $\hat{\rho}$ whose corresponding prediction function $f_{\hat \rho}$ of algorithm $F_{\hat \rho}$ minimizes the prediction error. That is, denote $f_{\hat \rho}(z_i^{(k)})$ as the predicted label of data point $z_i^{(k)}$ in the $k$-th problem instance, we want
\begin{align*}
\hat{\rho} = \text{argmin}_{\rho} \frac{1}{mn}\sum_{k=1}^m \sum_{i=1}^n \ell_{\mathrm{0-1}}(f_\rho(z_i^{(k)}), y_i^{(k)}).
\end{align*}\label{eqn:algo selection obj}
%
Each parameter value $\rho$ defines an algorithm $F_\rho$, mapping a problem instance $(X, Y)$ to a prediction function $f_\rho$, which induces a loss $\frac{1}{n}\sum_{i=1}^n \ell_{0-1}(f_\rho(z_i), y_i)$. We define $H_\rho$ as the function mapping $(X, Y)$ to this loss and $\mathcal{H}\rho = {H{\rho'}}_{\rho'}$ as the family of loss functions parameterized by $\rho$.

Note that our problem setting differs from prior theoretical works on graph-based semi-supervised learning. 
The classical setting considers a single algorithm and learning the model parameter from a single problem instance. We are considering \emph{families of algorithms}, each parameterized by a single hyperparameter, and aiming to learn the best \textit{hyperparameter} across \textit{multiple problem instances}.
Our setting combines transductive and inductive aspects: each instance has a fixed graph of size $n$ (transductive), but the graphs themselves are drawn from an unknown meta-distribution (inductive).\looseness-1


\paragraph{Complexity Measures and Generalization Bounds.}
% In practice, one commonly used technique for hyperparameter tuning is cross-validation, where at each iteration we train a small subset of the validation dataset and use the rest to test its performance. If we define the distribution of problem instances as a uniform distribution on these small subsets of validation data, then our generalization results imply the necessary number of iterations needed to effectively tune hyperparameters using cross validation. 


We study the generalization ability of several representative parameterized families of algorithms. That is, we aim to address the question of how many problem instances are required to learn a hyperparameter $\rho$ such that a learning algorithm can perform near-optimally for instances drawn from a fixed problem distribution. Clearly, the more complex the algorithm family, the more number of problem instances are needed.\looseness-1

Specifically, for each algorithm $f_{\hat{\rho}}$ trained given $m$ problem instances, we study the difference in the empirical $0$-$1$ loss and the actual $0$-$1$ on the distribution:
\begin{align*}
\frac{1}{mn}\sum_{k=1}^m &\sum_{i=1}^n \ell_{\mathrm{0-1}}(f_\rho(z_i^{(k)}), y_i^{(k)})\\
- &\mathbb{E}_{\left(X,Y\right) \sim \mathcal{D}}\left[\frac{1}{n}\sum_{i=1}^n \ell_{0-1}\left(f_{\hat \rho}(z_i),y_i\right)\right].
\end{align*}
\label{eqn:algo sec gen}


To quantify this, we consider two %learning-theoretic 
complexity measures for characterizing the learnability of algorithm families: the \textit{pseudo-dimension} and the \textit{Rademacher complexity}.\looseness-1

\begin{definition}[Pseudo-dimension]
    Let $\mathcal{H}$ be a set of real-valued functions from input space $\mathcal{X}$. We say that $C = (X^{(1)}, ..., X^{(m)}) \in \mathcal{X}^m$ is \textbf{pseudo-shattered} by $\mathcal{H}$ if there exists a vector $r = (r_1, ..., r_m) \in \mathbb{R}^m$ (called ``witness'') such that for all $b = (b_1, ..., b_m) \in \{\pm1\}^m$ there exists $H_b \in \mathcal{H}$ such that $sign(H_b(X^{(k)})-r_k) = b_k$. \textbf{Pseudo-dimension} of $\mathcal{H}$, denoted \textsc{Pdim}($\mathcal{H}$), is the cardinality of the largest set pseudo-shattered by $\mathcal{H}$.
\end{definition}

The following theorem bounds generalization error using pseudo-dimension.

\begin{theorem}\citep{AnthonyBartlett2009}
\label{thm:pdim gen theorem}
    Suppose $\mathcal{H}$ is a class of real-valued functions with range in $[0, 1]$ and finite $\textsc{Pdim}(\mathcal{H})$. Then for any $\epsilon > 0$ and $\delta \in (0,1)$, for any distribution $\mathcal{D}$ and for any set $S = \{X^{(1)}, \ldots, X^{(m)}\}$ of $m = O\big( \frac{1}{\epsilon^2}\left(\textsc{Pdim}(\mathcal{H}) + \log \frac{1}{\delta}\right)\big)$ samples from $\mathcal{D}$, with probability at least $1-\delta$, we have 
    \begin{align*}
        \left|\frac{1}{m}\sum_{k=1}^mH({X^{(k)}}) - \mathbb{E}_{X \sim \mathcal{D}}[H(X)]\right| \le \epsilon & \text{,  for all $H \in \mathcal{H}$}.
    \end{align*}
\end{theorem}

Therefore, if we can show $\textsc{Pdim}(\mathcal{H}_\rho)$ is bounded, then using the standard empirical risk minimization argument, Theorem~\ref{thm:pdim gen theorem} implies using $m = O\left(\nicefrac{\textsc{Pdim}(\mathcal{H})}{\epsilon^2}\right)$ problem instances, the expected error on test instances is upper bounded by $\epsilon$. In Section~\ref{sec:label_prop}, we will obtain \emph{optimal} pseudo-dimension bounds for three canonical label-propagation algorithm families.

Another classical complexity measure is the Rademacher complexity:
\begin{definition}[Rademacher Complexity]
    Given a space $\mathcal{X}$ and a distribution $\mathcal{D}$, let $S = \{X^{(1)}, \ldots, X^{(m)}\}$ be a set of examples drawn i.i.d. from $\mathcal{D}$. Let $\mathcal{H}$ be the class of functions $H: \mathcal{X} \rightarrow \mathbb{R}$. The \textbf{(empirical) Rademacher complexity} of $\mathcal{H}$ is 
    \[\hat R_m(\mathcal{H}) = \mathbb{E}_\sigma \left[\sup \left(\frac{1}{m}\sum_{k=1}^m \sigma_k H(X^{(k)})\right)\right],\]
    where each $\sigma_k$ is i.i.d. sampled from $\{-1, 1\}$. 
\end{definition}

The following theorem bounds generalization error using Rademacher Complexity.
\begin{theorem}\citep{MohriRostamizadehTalwalkar2012}
\label{thm:rc gen}
    Suppose $\mathcal{H}$ is a class of real-valued functions with range in $[0, 1]$. Then for any $\delta \in (0,1)$, any distribution $\mathcal{D}$, and any set $S = \{X^{(k)}\}_{k=1}^m$ of $m$ samples from $\mathcal{D}$, with probability at least $1-\delta$, we have\looseness-1
    \begin{align*}
        &\left|\frac{1}{m}\sum_{k=1}^m H({X^{(k)}}) - \mathbb{E}_{X \sim D}[H(X)]\right| \\
        &\qquad=  O\left( \hat R_m(\mathcal{H}) + \sqrt{\frac{1}{m} \log \frac{1}{\delta}}\right), ~~~\text{for all}~~~ H \in \mathcal{H}.
    \end{align*}
\end{theorem}

To bound the Rademacher complexity in our setting, we restrict to binary classification $c=2$ and change the label space to $Y \in \{-1, 1\}^{n}$.
For a predictor $f$, we also overload notation and let $f(z_i) \in [0,1]$ be the output probability of node $z_i$ being classified as $1$. Instead of directly using the $0$-$1$ loss, we upper bound it using margin loss, which is defined as\looseness-1
\[\ell_\gamma(f(z_i),y_i) = \1[a_i > 0] + (1 + a_i/\gamma)\1\left[a_i \in \left[-\gamma, 0\right]\right]\]
where  $a_i = -\tau(f(z_i), y_i)= (1-2f(z_i))y_i$. 
Then, $a_i>0$ if and only if $z_i$ is classified incorrectly.

Now we define $H^{\gamma}_{\rho}(X) = \frac{1}{n}\sum_{i=1}^n \ell_\gamma\left(f_{\rho}(z_i),y_i\right)$ to be the margin loss of the entire graph when using a parameterized algorithm $F_\rho$.
Based on this definition, we have an induced loss function family $\mathcal{H}^{\gamma}_\rho$.
Then, given $m$ instances, for any $\gamma > 0$, we can obtain an upper bound for all $H_\rho^\gamma \in \mathcal{H}_\rho^\gamma$:
\begin{align*}
\mathbb{E}&_{\left(X,Y\right) \sim \mathcal{D}}\left[\frac{1}{n}\sum_{i=1}^n \ell_{0-1}\left(f_{\hat{\rho}}(z_i),y_i\right)\right] \\
&\le  \; \mathbb{E}_{\left(X,Y\right) \sim \mathcal{D}}\!\left[\frac{1}{n}\!\sum_{i=1}^n \ell_{\gamma}\left(f_{\hat{\rho}}(z_i),y_i\right)\right] \tag{by definition of $\ell_\gamma$} \\
&= \frac{1}{m} \sum_{i=1}^m H^{\gamma}_{\rho}(X^{(k)}) + O\left(\hat R_m(\mathcal{H}_\rho^\gamma) + \sqrt{\frac{\log{(1/\delta)}}{m}}\right). \tag{by Theorem~\ref{thm:rc gen}}
\end{align*}
Therefore, suppose we find a $\hat{\rho}$ whose empirical margin loss ${1}/{m} \sum_{i=1}^m H^{\gamma}_{\hat{\rho}}(X^{(k)})$ is small, and if we can show $\hat R_m(\mathcal{H}_\rho^\gamma)$ is small, then $F_{\hat{\rho}}$ is a strong algorithm for the new problem instances.
In Section~\ref{sec:gcn}, we bound the Rademacher complexity of graph neural network-based algorithm families.

Note that these guarantees we obtained can also be applied to some standard hyperparameter tuning methods like grid search with cross-validation.
For example, in k-fold cross-validation, if we define the distribution of problem instances as a uniform distribution on these small subsets of validation data, then our results imply the necessary number of iterations $k$ needed to effectively tune hyperparameters using cross-validation. That is, the number of folds of cross-validation needed to learn a hyperparameter that performs nearly as well as the hyperparameter if cross-validation were run to convergence.