\section{Experiments}
In this section, we empirically verify the effectiveness of our hyperparameter selection method.

We focus on our GCAN architecture, aiming to demonstrate our approach's effectiveness for selecting algorithm hyperparameters in our setup. To illustrate this, we compare the performance of GCAN with tuned hyperparameters against GAT and GCN.

For each dataset, we sample 20 random sub-graphs of 100 nodes to learn the optimal hyperparameter $\eta$ via backpropagation. A large disconnected graph is formed by combining these sub-graphs, allowing parameter values to vary across graphs while sharing a unified learnable $\eta$. The optimized hyperparameter is then tested on another 20 test sub-graphs from the same dataset.

We also compare our backpropagation-based approach with Bayesian Optimization (see e.g. \cite{frazier2018tutorialbayesianoptimization}). Using the same 20 training sub-graphs, we perform Bayesian Optimization to select the hyperparameter $\eta$, ensuring both methods use an equal number of forward passes. The selected $\eta$ is then evaluated on a separate set of 20 test sub-graphs from the same dataset.

The results on the test set are shown in \Cref{fig:gcan_multi}. Note that GCN outperforms GAT on some datasets (e.g.\ CORA, CoAuthorCS) and GCN performs better on others (e.g.\ CIFAR10, see also \cite{dwivedi2023benchmarking}). With GCAN, we can achieve the best performance on most datasets.
Indeed, as seen in \Cref{fig:gcan_multi}, GCAN consistently achieves higher or comparable accuracy compared to both GAT and GCN across all datasets. Notably, GCAN demonstrates significant improvements in CIFAR10 and CoAuthorCS, highlighting its effectiveness in these scenarios. Also, comparing backpropagation with Bayesian Optimization, backpropagation achieves better performance on more datasets (e.g. \ CIFAR10, CoAuthorCS, Actor), but Bayesian Optimization is more effective in certain datasets (e.g. \ CORA, AmazonPhotos).

% Additional experiments are located in the Appendix.
In \Cref{appendix:experiments}, we also conduct experiments to empirically verify the results in \Cref{sec:label_prop}. We show that by selecting the number of problem instances $m = O(\log n / \epsilon^2)$, the empirical generalization error is within $O(\epsilon)$, matching our theoretical results. We also have further details on the empirical setup and the variation of the accuracy of GCAN with the hyperparameter $\eta$ in the Appendix. 