\section{Preliminaries} \label{sec:background}

We begin by defining the preliminary concepts and notation used throughout this paper. A summary of the key symbols is provided in Table~\ref{tab:notation} for easy reference.

\begin{table}[!ht]
\centering
\caption{Summary of key notation.}
\label{tab:notation}
\begin{tabular}{ll}
\toprule
\textbf{Symbol} & \textbf{Description} \\
\midrule
\multicolumn{2}{l}{\textit{Graphs and Federated Learning}} \\
$\mathcal{G}, \mathcal{V}, \mathcal{E}$ & A graph with a set of nodes and edges. \\
$X, Y$ & Node features and labels. \\
$K$ & Total number of clients. \\
$\mathcal{G}^k, \mathcal{V}^k, \mathcal{E}^k$ & Subgraph at client $k$. \\
$P_k$ & Data distribution at client $k$. \\
$m_k$ & Training set size at client $k$. \\
$n_k$ & Calibration set size at client $k$. \\
\midrule
\multicolumn{2}{l}{\textit{Conformal Prediction}} \\
$\alpha$ & Target miscoverage level. \\
$S(x, y)$ & Non-conformity function. \\
$s_v$ or $s_i^k$ & Non-conformity score for a node. \\
$\hat{q}_\alpha$ & Empirical score quantile (cutoff). \\
$C_\alpha(x)$ & Conformal prediction set. \\
\midrule
\multicolumn{2}{l}{\textit{Generative Model}} \\
$c_m$ & Feature prototype (cluster center). \\
$\hat{X}$ & Aggregated set of all prototype features. \\
$M$ & Number of feature prototypes per client $k$. \\
$p$ & Percentage of new edges to add. \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Conformal Prediction}

Conformal prediction is a framework for uncertainty quantification that provides rigorous statistical guarantees. We focus on the split conformal prediction method \citep{vovk2005algorithmic}, notable for its computational efficiency. The method defines a non-conformity measure \( S: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R} \), which quantifies how atypical the true label \( y \) is for the input \( x \) according to the model's predictions. For classification tasks, \( S(x, y) \) might be defined as \( 1 - f_y(x) \), where \( f_y(x) \) is the estimated probability of class \( y \) given \( x \).

\subsubsection{Quantile Calculation and Prediction Set Construction}

Using a calibration dataset \( \mathcal{D}_{\text{calib}} = \{(x_i, y_i)\}_{i=1}^n \), we compute the non-conformity scores \( S_i = S(x_i, y_i) \) for each calibration example. The cutoff value $\hat{q}_\alpha$ is then determined as the \( (1 - \alpha)(1 + \frac{1}{n}) \)-th empirical quantile of these scores, i.e., \( \hat{q}_\alpha = \text{quantile}\left( \{S_1, \ldots, S_n\},\, (1 - \alpha)\left(1 + \frac{1}{n}\right) \right) \). Given a new input \( x \), the prediction set is constructed as \( C_\alpha(x) = \{ y \in \mathcal{Y} : S(x, y) \leq \hat{q} \} \). Under the assumption of exchangeability of the data, this method guarantees that the true label \( y \) will be contained in \( C_\alpha(x) \) with probability at least \( 1 - \alpha \).

Adaptive Prediction Sets (APS)~\citep{romano2020classification} construct prediction sets by accumulating class probabilities. Given a probabilistic classifier that outputs estimated class probabilities \( f(x) = (f_1(x), \ldots, f_{|\mathcal{Y}|}(x)) \), where \( f_j(x) \) is the estimated probability of class \( j \) for input \( x \), we sort the classes in descending order to obtain a permutation \( \pi \) such that \( f_{\pi(1)}(x) \geq f_{\pi(2)}(x) \geq \ldots \geq f_{\pi(|\mathcal{Y}|)}(x) \). The cumulative probability up to the \( k \)-th class is \( V(x, k) = \sum_{j=1}^k f_{\pi(j)}(x) \). For each calibration example \( (x_i, y_i) \), we compute the non-conformity score \( S_i = V(x_i, k_i) \), where \( k_i \) is the rank of the true label \( y_i \) in the sorted class probabilities for \( x_i \). The cutoff value \( \hat{q} \) is then determined as before. The prediction set \( C_\alpha(x) \) includes the top \( k^* \) classes, where \( k^* = \min \left\{ k : V(x, k) \geq \hat{q} \right\} \) and \( C_\alpha(x) = \{ \pi(1), \ldots, \pi(k^*) \} \). While we use APS for presenting our main results, other scores like Regularized Adaptive Prediction Sets (RAPS) and Least Ambiguous Set-Valued Classifiers (LAC) are also commonly used. We provide a comparative analysis of these non-conformity scores in Appendix~\ref{appendix:nonconformity_comparison}.

\subsubsection{Evaluation Metrics}

Our goal is to achieve valid marginal coverage while minimizing the size of the prediction sets. The inefficiency is measured as
\( \text{Inefficiency}_\alpha = \frac{1}{m} \sum_{j=1}^m |C_\alpha(x_j)| \),
on the test set \( \mathcal{D}_{\text{test}} = \{ (x_j, y_j) \}_{j=1}^m \), where $m$ denotes the number of test samples. Empirical coverage is calculated as
\( \text{Coverage}_\alpha = \frac{1}{m} \sum_{j=1}^m \mathbbm{1}\{ y_j \in C_\alpha(x_j) \} \),
representing the proportion of test examples where the true label is included in the prediction set.

\subsection{GNNs and Federated Graph Learning}

GNNs effectively capture structural information and node features in graph-structured data \citep{kipf2017semisupervisedclassificationgraphconvolutional}. Consider a graph \( \mathcal{G} = (\mathcal{V}, \mathcal{E}) \), where \( \mathcal{V} \) is the set of \( n \) nodes and \( \mathcal{E} \) is the set of edges. Each node \( v \in \mathcal{V} \) is associated with a feature vector \( x_v \in \mathbb{R}^d \), forming the input matrix \( X = \{x_v\}_{v \in \mathcal{V}} \in \mathbb{R}^{n \times d} \).

In node classification, the goal is to predict labels for nodes by leveraging both node features and the graph topology. We operate under a transductive learning setting where the full graph \( \mathcal{G} \) is available during training and testing, but test labels are withheld. To enable conformal prediction, we partition the node set \( \mathcal{V} \) into four disjoint subsets: training, validation, calibration, and test nodes, denoted as \( \mathcal{V}_{\text{train}}, \mathcal{V}_{\text{valid}}, \mathcal{V}_{\text{cal}}, \) and \( \mathcal{V}_{\text{test}} \), respectively.

A GNN produces node representations through a sequence of message-passing layers. At each layer \( \ell \), a node \( u \) receives messages from its neighbors \( v \in \mathcal{N}_u \), computed using a learnable function \( \text{MSG}(h_u^{(\ell-1)}, h_v^{(\ell-1)}) \), where \( h_u^{(\ell-1)} \) denotes the embedding of node \( u \) from the previous layer. The incoming messages are aggregated via a permutation-invariant function \texttt{AGG}, and the node embedding is updated using a learnable function \texttt{UPD}:

\begin{align*}
h_u^{(\ell)} = \texttt{UPD} \Big(& \texttt{AGG} \big( \{ \texttt{MSG}(h_u^{(\ell-1)}, h_v^{(\ell-1)}) \mid v \in \mathcal{N}_u \} \big), \\
& h_u^{(\ell-1)} \Big).
\end{align*}

The final-layer embeddings are used by a classifier to produce predictions \( \mu(X) \), which are used to compute the supervised loss over the training set \( \mathcal{V}_{\text{train}} \).

Federated Graph Neural Networks extend GNNs to settings where graph data is distributed across multiple clients \citep{liu2024federated}. A central server coordinates with \( K \) clients, each holding a subgraph \( \mathcal{G}^k \subset \mathcal{G} \). Each client independently trains a local GNN model on its subgraph using its local labeled nodes. After local training, model parameters \( \theta_k \) are transmitted to the server, which aggregates them using Federated Averaging (FedAvg) \citep{mcmahan2017communication}:

\[
\theta = \sum_{k=1}^K \frac{m_k}{m} \theta_k,
\]

where \( \theta_k \) denotes the parameters of the local GNN model at client \( k \), \( m_k \) is the number of local training samples, and \( m = \sum_{k=1}^K m_k \). The aggregated global model \( \theta \) is then broadcast back to clients for the next round of training. This process enables collaborative training while preserving data privacy, as no raw data or node features are shared between clients.

In our framework, the FedAvg aggregation is applied to both the GNNs used for node classification and the Variational Graph Autoencoders (VGAEs) used for generating missing neighbor links (Section~\ref{sec:method2}). The shared model parameters \( \theta \) include all learnable weights in the encoder layers, ensuring consistency across clients without exposing private subgraph structure.

\subsection{Variational Autoencoders} \label{sec:vae}

VAEs \citep{kingma2013auto} and their extension to graph data, VGAEs \citep{kipf2016variational}, are fundamental to our approach for generating node features and predicting edges. Both methods utilize deep learning and Bayesian inference to learn latent representations by optimizing the evidence lower bound (ELBO), balancing reconstruction loss and the Kullback-Leibler (KL) divergence between the approximate and prior distributions. The ELBO is defined as: 
$\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\text{KL}}[q_\phi(z|x) \| p(z)]$, 
where $q_\phi(z|x)$ approximates the latent variable $z$, and $p(z)$ is the prior. In our model, VAEs generate node features, while VGAEs learn latent representations of graph structures for the edge prediction task.