\section{GNN Families and  Generalization Guarantees}\label{sec:gcn}
In this section, we study hyperparameter selection for Graph Neural Networks (GNNs)~\citep{kipf2016semi,velivckovic2017graph,iscen2019label}, which excel in tasks involving graph-structured data like social networks, recommendation systems, and citation networks. To understand generalization in hyperparameter selection for GNNs, we analyze Rademacher complexity. 

To the best of our knowledge, we are the first to provide generalization guarantees for hyperparameter selection. Prior work~\citep{garg2020generaliz} focused on Rademacher complexity for graph classification with fixed hyperparameters, whereas we address node classification across multiple instances, optimizing hyperparameters.

In Section~\ref{sec:sgc}, we examine the Rademacher complexity bound of a basic Simplified Graph Convolutional Network~\citep{wu19simplifying} family, as a foundation for the more complex family. 

In Section~\ref{sec:interpolated GCAN}, we introduce a novel architecture, which we call GCAN, that uses a hyperparameter $\eta \in [0,1]$ to interpolate two popular GNNs: the graph convolutional neural networks (GCN) and graph attention neural networks (GAT). GCAN selects the optimal model for specific datasets: $\eta = 0$ corresponds to GCN, $\eta = 1$ to GAT, and intermediate values explore hybrid architectures that may outperform both. We also establish a Rademacher complexity bound for the GCAN family.

Our proofs for SGC and GCAN share a common strategy: modeling the $0$-$1$ loss of each problem instance as an aggregation of single-node losses, reducing the problem to bounding the Rademacher complexity of computation trees for individual nodes. Specifically, we upper bound the $0$-$1$ loss with a margin loss, then relate the complexity of problem instances to the computation trees of nodes. Using a covering argument, we bound the complexity of these trees by analyzing margin loss changes due to parameter variations.\looseness-1

% It remains for us to bound the empirical Rademacher complexity $\hat R_{S}(\mathcal{H}_\rho^\gamma)$. We relate the Rademacher complexity of $\mathcal{H}_\rho^\gamma$ to the complexity of loss of each node. 
For each node $z_i$, we define its computation tree of depth $L$ to represent the structured $L$-hop neighborhood of $v$, where the children of any node $u$ are the neighbors of $u$, $\mathcal{N}_u$. Denote the computation tree of $z_i$ as $t_i$, and the learned parameter as $\theta$, then $l_\gamma(z_i) = l_\gamma(t_i, \theta)$. 
%
We can now rewrite $l_\gamma(Z)$ as an expectation over functions applied to computation trees. Let $t_1, ..., t_t$ be the set of all possible computation trees of depth $L$, and $w_i(Z)$ the number of times $t_i$ occurs in $Z$. Then, we have
\[l_\gamma(Z) = \sum_{i=1}^t \frac{w_i(Z)}{\sum_{j=1}^t w_j(Z)} l_\gamma(t_i, \theta) = \mathbb{E}_{t\sim w^\prime(Z)} l_\gamma(t,\theta).\]

The following proposition indicates that it suffices to bound the Rademacher Complexity of single-node computation trees.\looseness-1 

\begin{proposition}[Proposition 6 from \cite{garg2020generaliz}.]
    Let $S = \{Z_1, ..., Z_m\}$ be a set of i.i.d. graphs, and let $\mathcal{T} = \{t_1, ..., t_m\}$ be such that $t_j \sim w^\prime (Z_j), j \in [m]$. Denote by $\hat R_{S}$ and $\hat R_{\mathcal{T}}$ the empirical Rademacher complexity of $\mathcal{H}_\rho^\gamma$s for graphs $S$ and trees $\mathcal{T}$. Then, $\hat R_{S} = \mathbb{E}_{t_1, ..., t_m} \hat R_{\mathcal{T}}$.
\end{proposition}

\subsection{Simplified Graph Convolutional Network Family}\label{sec:sgc}

Simplified Graph Convolution Network (SGC) is introduced by \citet{wu19simplifying}. By removing nonlinearities and collapsing weight matrices between consecutive layers, SGC reduces the complexity of GCN while maintaining high accuracy in many applications.

Consider input data $X = (n, Z, L, G)$, where the feature is written as a matrix $Z \in \R^{n \times d}$. For any value of the hyperparameter $\beta \in [0,1]$, let $\tilde{W} = W + \beta I$ be the augmented adjacency matrix, $\tilde{D} = D + \beta I$ be the corresponding degree matrix, and $S = \tilde{D}^{-1/2}\tilde{W}\tilde{D}^{-1/2}$ be the normalized adjacency matrix. Let $\theta \in \mathbb{R}^{d}$ be the learned parameter. The SGC classifier of depth $L$ is
\[\hat Y = \text{softmax}(S^L Z \theta).\]

% While \citet{wu19simplifying} introduced this architecture, they did not provide instructions on how to select the best hyperparameter $\beta$.
We focus on learning the algorithm hyperparameter $\beta \in [0,1]$ and define the SGC algorithm family as $\mathcal{F}_{\beta}$. We denote the class of margin losses induced by $\cF_\beta$ as $\mathcal{H}_{\beta}^{\gamma}$.
To study the generalization ability to tune $\beta$, we bound the Rademacher complexity of $\mathcal{H}_{\beta}^{\gamma}$. The proof is detailed in Appendix~\ref{appendix:SGC}.

\begin{theorem}\label{thm:rc of SGC}
    Assuming $D, W,$ and $Z$ are bounded (the assumptions in \cite{bartlett2017spectrally,garg2020generaliz}), i.e. $d_i \in [C_{dl}, C_{dh}] \subset \mathbb{R}^+$, $w_{ij}\in [0,C_w]$, and $\|Z\|\leq C_z$, we have that the 
    Rademacher complexity of $\mathcal{H}^{\gamma}_{\beta}$ is bounded:
    \begin{align*}
    \hat R_{m}(\mathcal{H}^{\gamma}_{\beta})
    = O\left( \frac{\sqrt{dL\log\frac{C_{dh}}{C_{dl}} + d\log{\frac{mC_zC_\theta}{\gamma}}}}{\sqrt{m}}\right).
    % \leq \frac{4}{m} + \frac{12\sqrt{(d+1)\log(16\sqrt{m}\max\{k_2, k_3\})}}{\sqrt{m}},
    \end{align*}
\end{theorem}
This theorem indicates that the number of problem instances needed to learn a near-optimal hyperparameter only scale polynomially with the input feature dimension $d$ and the number of layers $L$ of the neural networks, and only scales logarithmically with the norm bounds $C$'s and the margin $\gamma$. 

\subsection{GCAN Interpolation and  Rademacher Complexity Bounds} \label{sec:interpolated GCAN}
In practice, GCN and GAT outperform each other in different problem instances \citep{dwivedi2023benchmarking}. To effectively choose the better algorithm, we introduce a family of algorithms that \emph{interpolates} GCN and GAT, parameterized by $\eta \in [0,1]$. 
This family includes both GCN and GAT, so by choosing the best algorithm within this family, we can automatically select the better algorithm of the two, specifically for each input data. 
% In practice, running GAT and GCN separately would require new research engineers to evaluate the architectures, whereas using GCAN we could simplify this tedious task by running only a single model. 
% In practice, employing GCAN eliminates the need for new research engineers to separately evaluate each architecture, thereby simplifying what would otherwise be a cumbersome task. 
Moreover, GCAN could potentially outperform both GAT and GCN by taking $\eta$ as values other than $0$ and $1$.
We believe such an interpolation technique could potentially be used to select between other algorithms that share similar architecture. 
% We introduce the GCAN architecture below.
%
% In \Cref{sec:experiments}, we also empirically evaluate the effectiveness of learning $\eta$ parameters in our novel GCAN algorithm family. 

Recall that in both GAT and GCN, the update equation has the form of activation and a summation over the feature of all neighboring vertices in the graph (a brief description of GAT and GCN is given in \Cref{appendix:GAT_GCN}). Thus, we can interpolate between the two update rules by introducing a hyperparameter $\eta \in [0,1]$, where $\eta = 0$ corresponds to GCN and $\eta=1$ corresponds to GAT. Formally, given input $X = (n, \{z_i\}_{i=1}^n, L, G)$, we initialize $h_i^0 = z_i$ and update at a level $\ell$ by
\begin{align*}
    &h_i^{\ell} = \sigma\left(\sum_{j \in \mathcal{N}_i} \left(\eta \cdot e_{ij}^{\ell} + (1 - \eta) \cdot \frac{1}{\sqrt{d_i d_j }} \right)  U^{\ell}  h_j^{\ell} \right),
\end{align*}
{where}
\begin{align*}
    &e_{ij}^{\ell} = \frac{\exp(\hat{e}_{ij}^{\ell})}{\sum_{j' \in \mathcal{N}_i} \exp(\hat{e}_{ij'}^{\ell})}, ~ \hat{e}_{ij}^{\ell} = \sigma(V^{\ell} [U^{\ell}h_i^{\ell}, U^{\ell}h_j^{\ell}]).
\end{align*}
Here $e_{ij}^{\ell}$ is the attention score of node $j$ for node $i$. $V^{\ell}$ and $U^{\ell}$ are learnable parameters. $\sigma(\cdot)$ is a $1$-Lipschitz activation function (e.g. ReLU, sigmoid, etc.).
$[U^{\ell}h_i^{\ell}, U^{\ell}h_j^{\ell}]$ is the concatenation of $U^{\ell}h_i^{\ell}$ and $U^{\ell}h_j^{\ell}$. 
%
We denote this algorithm family by $\mathcal{F}_\eta$ and the induced margin loss class by $\mathcal{H}_\eta^\gamma$. 

% Each value of $\eta$ induces an architecture, thereby generating a parameterized family of algorithms, denoted as $\mathcal{F}_{\eta}$. Consequently, this induces a class of margin losses $\mathcal{H}_{\eta}^\gamma$.

While our primary focus is not the comparative performance of GCAN against GAT or GCN, our curiosity led us to conduct additional experiments, presented in \Cref{appendix:experiments}. The results consistently show that GCAN matches or exceeds the performance of both GAT and GCN.

\begin{theorem}\label{thm:rc of GCAN}
    Assume the parameter $U^\ell$ is shared over all layers, i.e. $U^\ell = U$ for all $\ell \in [L]$ (the assumption used in~\cite{garg2020generaliz}), and the parameters are bounded: $\|U\|_F \leq C_U$, $\|V^\ell\|_2 \leq C_V$, $\|z_i\|\leq C_z$, and $d_i \in [C_{dl}, C_{dh}]$. Denoting the branching factor by $r = \max_{i \in [n]} |\sum_{j \in [n]}\ind[w_{ij} \neq 0]|$, we have that 
    the Rademacher complexity of $\mathcal{H}^\gamma_{\eta}$ is bounded:
    \begin{align*}
        \hat R_{m}(\mathcal{H}^\gamma_{\eta})
        = O\left( \frac{d\sqrt{L\log \frac{rC_U}{C_{dl} + C_U} + \log\frac{mdC_z}{\gamma}}}{\sqrt{m}} \right).
    \end{align*}
\end{theorem}
The proof of Theorem~\ref{thm:rc of GCAN} is similar to that of Theorem~\ref{thm:rc of SGC}. See Appendix~\ref{appendix:GCAN} for details.
%%% putting it here to resolve formatting issues
\begin{figure*}[ht]
    \centering
    \includegraphics[width=0.99\linewidth]{experiments/GCAN_bo.png}
    \caption{Validation Accuracy (computed on the unlabeled nodes across 20 testing graphs) vs. iterations. GCAN competes with the better accuracy between GAT and GCN across datasets. }
    \label{fig:gcan_multi}
\end{figure*}
%%%
\begin{remark}
    The main difference between the Rademacher Complexity of Simplified Graph Convolution Network (\Cref{thm:rc of SGC}) and GCAN (\Cref{thm:rc of GCAN}) is the dependency on feature dimension $d$: $\sqrt{d}$ for SGC and $d$ for GCAN. This difference arises from the dimensionality of the parameters. The parameter $\theta$ in SGC has dimension $d$, but the parameter $U$ and $V$ in GCAN have dimension $d \times d$ and $1 \times 2d$, respectively. As GCAN is a richer model, it requires more samples to learn, but this is not a drawback; its complexity allows it to outperform SGC in many scenarios. 
\end{remark}
\begin{remark}
    There are no direct dependencies on $n$ in \Cref{thm:rc of SGC} and \Cref{thm:rc of GCAN}, but the dependency is implicitly captured by the more fine-grained value $C_{dl}, C_{dh}$, and $C_Z$. Here, $C_{dl}$ and $C_{dh}$ are the lower and upper bounds of the degree (number of neighbors) of the nodes, which generally increase with $n$.
    $C_Z$ is the Frobenius norm of the feature matrix $Z \in \mathbb{R}^{n \times d}$. Since the size of $Z$ scales with $n$, the value of $C_Z$ is generally larger for larger $n$.
\end{remark}