\onecolumn
\title{Supplementary Material}
\maketitle

\appendix

\section{Partial Exchangeability Proofs}
\label{appendix:1}

\subsection{Proof of Lemma 1}

Consider the unordered graph \(\mathcal{G}^k = (\mathcal{V}^k, \mathcal{E}^k)\) within the permutation-invariant graph learning environment as outlined in the conditions of Lemma 1. Assuming that the graph structure, attribute information, and node label information are fixed, we define the nonconformity scores at nodes in \(\mathcal{V}^k_{\text{calib}} \cup \mathcal{V}^k_{\text{test}}\) as
\[
\{s_v\} = S\left(\mathcal{V}^k, \mathcal{E}^k, \{(x_v, y_v)\}_{v \in \mathcal{V}^k_{\text{train}} \cup \mathcal{V}^k_{\text{valid}}}, \{x_v\}_{v \in \mathcal{V}^k_{\text{calib}} \cup \mathcal{V}^k_{\text{test}}}\right),
\]
where \(S\) denotes the scoring function used to compute the nonconformity scores.

Due to the permutation invariance of the model (Assumption 1), for any permutation \(\pi\) of the nodes in \(\mathcal{V}^k_{\text{calib}} \cup \mathcal{V}^k_{\text{test}}\), the nonconformity scores remain unchanged. Specifically, we have
\[
\{s_v\} = S\left(\pi\left(\mathcal{V}^k\right), \pi\left(\mathcal{E}^k\right), \{(x_v, y_v)\}_{v \in \mathcal{V}^k_{\text{train}} \cup \mathcal{V}^k_{\text{valid}}}, \{x_{\pi(v)}\}_{v \in \mathcal{V}^k_{\text{calib}} \cup \mathcal{V}^k_{\text{test}}}\right).
\]
Here, \(\pi\left(\mathcal{V}^k\right)\) and \(\pi\left(\mathcal{E}^k\right)\) denote the vertex set and edge set permuted according to \(\pi\).

This invariance implies that, regardless of the permutation of nodes in \(\mathcal{V}^k_{\text{calib}} \cup \mathcal{V}^k_{\text{test}}\), the computed nonconformity scores \(\{s_v\}\) remain the same. Therefore, the unordered set of scores \(\{s_v\}_{v \in \mathcal{V}^k_{\text{calib}} \cup \mathcal{V}^k_{\text{test}}}\) is invariant under permutations of the nodes, confirming the lemma's assertion about the stability and invariance of the score set in this setting.

\subsection{Remark on Assumption 2}
\label{appendix:assumption2}

Under Assumption 2, the nonconformity scores \(\{s_{v_i}\}_{v_i \in \mathcal{V}^k_{\text{calib}}}\) for client \(k\) are identically distributed and exchangeable. Extending this set to include the score \(s_{v_{\text{test}}} = S(x_{v_{\text{test}}}, y_{v_{\text{test}}})\), where \((x_{v_{\text{test}}}, y_{v_{\text{test}}}) \sim P_k\) (the distribution for client \(k\)), the augmented set \(\{s_{v_i}\}_{v_i \in \mathcal{V}^k_{\text{calib}}} \cup \{s_{v_{\text{test}}}\}\) remains identically distributed and exchangeable.

This demonstrates that \(s_{v_{\text{test}}}\) is equivalent in distribution to any \(s_{v_i}\) in the calibration set. Therefore, the test score \(s_{v_{\text{test}}}\) can be considered as an additional sample from the same distribution, affirming the IID and exchangeability conditions outlined in Assumption 2.

\subsection{Proof of Theorem 1}
\label{appendix:theorem1}

We aim to show that under the given assumptions, the conformal prediction framework achieves the intended coverage guarantees.

Let \(N = \sum_{k=1}^K n_k\) be the total number of calibration nodes across all clients, where \(n_k\) is the number of calibration nodes for client \(k\). Define \(p_k = \dfrac{n_k + 1}{N + K}\), so that \(\sum_{k=1}^K p_k = 1\).

For each client \(k\), let \(m_k(q)\) denote the number of nonconformity scores less than or equal to \(q\) among the \(n_k + 1\) scores (including the test node), that is,
\[
m_k(q) = \left|\left\{s_v \mid s_v \leq q, \, v \in \mathcal{V}^k_{\text{calib}} \cup \{v_{\text{test}}\}\right\}\right|.
\]


Recall that the conformal quantile \(\hat{q}_\alpha\) is defined as the \(\lceil (1 - \alpha)(N + K) \rceil\)-th smallest nonconformity score among all calibration scores and test scores from all clients. Thus,
\[
\sum_{k=1}^K m_k(\hat{q}_\alpha) = \lceil (1 - \alpha)(N + K) \rceil.
\]

Define the event \(\mathcal{E}\) as the combined ordering of nonconformity scores within each client, that is,
\[
\mathcal{E} = \left\{ \forall k \in [K], \text{ the nonconformity scores } \{s^k_i\}_{i=1}^{n_k + 1} \text{ are in a fixed order} \right\},
\]
where \(\{s^k_i\}_{i=1}^{n_k + 1}\) are the nonconformity scores for client \(k\), including the test score, sorted in some fixed order.

Conditioned on \(\mathcal{E}\), the number of scores less than or equal to \(\hat{q}_\alpha\), \(m_k(\hat{q}_\alpha)\), is deterministic for each client \(k\).

Under the exchangeability assumption, the probability that the test score \(s_{v_{\text{test}}}\) is less than or equal to \(\hat{q}_\alpha\) conditioned on \(\mathcal{E}\) is
\[
P(s_{v_{\text{test}}} \leq \hat{q}_\alpha \mid \mathcal{E}) = \sum_{k=1}^K p_k \cdot \frac{m_k(\hat{q}_\alpha)}{n_k + 1}.
\]

Therefore, we have
\[
P(s_{v_{\text{test}}} \leq \hat{q}_\alpha \mid \mathcal{E}) = \frac{\sum_{k=1}^K m_k(\hat{q}_\alpha)}{N + K} = \frac{\lceil (1 - \alpha)(N + K) \rceil}{N + K} \geq 1 - \alpha.
\]

Similarly, we can derive an upper bound:
\[
P(s_{v_{\text{test}}} \leq \hat{q}_\alpha \mid \mathcal{E}) \leq \frac{\sum_{k=1}^K (m_k(\hat{q}_\alpha) + 1)}{N + K} = \frac{\lceil (1 - \alpha)(N + K) \rceil + K}{N + K} \leq 1 - \alpha + \frac{K}{N + K}.
\]

Thus, we have established that the coverage probability satisfies
\[
1 - \alpha \leq P(s_{v_{\text{test}}} \leq \hat{q}_\alpha \mid \mathcal{E}) \leq 1 - \alpha + \frac{K}{N + K}.
\]

Since \(\mathcal{E}\) has probability 1 (it conditions on the ordering which is always possible), the unconditional probability satisfies the same bounds. This completes the proof that the conformal predictor maintains the desired coverage level under the partial exchangeability and permutation invariance assumptions in the graph-structured federated learning setting.


\section{Model Details and Detailed Algorithm}
\label{appendix:algo}

The subsequent sections detail the algorithms employed in our proposed methodology, encompassing node generation, edge formation, and the application of CP to federated node classification tasks. 

\begin{algorithm}[H]
\caption{Federated Graph Learning with Missing Neighbor Generation and Conformal Prediction}
\label{alg:missing-neighbor-generation-conformal-prediction}
\begin{algorithmic}[1]
\Require 
    \( K \): Number of clients \\
    \( \{ (\mathcal{V}^k_{\text{train}}, X^k_{\text{train}}, \mathcal{E}^k) \}_{k=1}^K \): Local datasets \\
    \( M_k \): Number of clusters per client \\
    \( p\% \): Top percentage for edge selection \\
    \( R \): Number of federated rounds \\
    Learning rates, other hyperparameters

\Ensure 
    Prediction sets \( \{ C_\alpha(x) \} \) for test nodes across clients
\Statex

\State \textbf{Step 1: Generate prototype node features}
\For{each client \( k = 1 \) to \( K \) \textbf{in parallel}}
    \State Train VAE \( q_{\phi_k}(z|x) \), \( p_{\theta_k}(x|z) \)
    \State Reconstruct features \( \tilde{x}_v = p_{\theta_k}(q_{\phi_k}(x_v)) \)
    \State Cluster \( \{ \tilde{x}_v \} \) into \( M_k \) centers \( \{ c_m^k \} \)
    \State Send \( \{ c_m^k \} \) to the server
\EndFor
\Statex
\State \textbf{Step 2: Aggregate and broadcast prototypes}
\State Aggregate \( \hat{X} = \bigcup_{k=1}^K \{ c_m^k \} \)
\State Broadcast \( \hat{X} \) to all clients
\Statex
\State \textbf{Step 3: Federated training of VGAE}
\State Initialize global VGAE parameters \( \Theta \)
\For{each round \( r = 1 \) to \( R \)}
    \For{each client \( k = 1 \) to \( K \) \textbf{in parallel}}
        \State Receive \( \Theta \)
        \State Train local VGAE \( q_{\psi_k}(Z|X^k, \mathcal{E}^k) \), \( p_{\varphi_k}(\mathcal{E}^k|Z) \)
        \State Send updated \( \Theta_k \) to server
    \EndFor
    \State Aggregate \( \Theta \leftarrow \frac{1}{K} \sum_{k=1}^K \Theta_k \)
\EndFor
\Statex
\State \textbf{Step 4: Link prediction and graph update}
\For{each client \( k = 1 \) to \( K \)}
    \State Compute edge probabilities \( \hat{P}^k = \text{VGAE}_\Theta(X^k, \mathcal{E}^k) \)
    \State Select top \( p\% \) edges to form new set \( \hat{\mathcal{E}}^k \)
    \State Update \( \mathcal{E}^k \leftarrow \mathcal{E}^k \cup \hat{\mathcal{E}}^k \)
\EndFor
\Statex
\State \textbf{Step 5: Federated GCN training}
\State Initialize global GCN parameters \( \theta \)
\For{each round \( r = 1 \) to \( R \)}
    \For{each client \( k = 1 \) to \( K \) \textbf{in parallel}}
        \State Receive \( \theta \)
        \State Train local GCN on \( (\mathcal{V}^k_{\text{train}}, X^k, \mathcal{E}^k) \)
        \State Send updated \( \theta_k \) to server
    \EndFor
    \State Aggregate \( \theta \leftarrow \sum_{k=1}^K \frac{n_k}{n} \theta_k \)
\EndFor
\Statex
\State \textbf{Step 6: Federated Conformal Prediction}
\For{each client \( k = 1 \) to \( K \)}
    \State Use global GCN to compute predictions \( \mu(x) \) and non-conformity scores
    \State Tune temperature \( T \) based on validation data
    \State Compute local conformal quantile \( q^k \) from calibration scores and share with the server
\EndFor
\State Aggregate quantiles on the server to compute global quantile \( q \)
\State Construct prediction sets \( C_\alpha(x) \) for test data using \( q \)
\end{algorithmic}
\end{algorithm}

\newpage

\subsection{Sparsity Regularization for Node Feature Generation}

In addition to the standard reconstruction and KL-divergence losses in the \texttt{VAE}, we incorporate a sparsity regularization term to encourage the generated node features to reflect the sparse nature of real-world graph data. This is crucial for datasets where most node features are inherently sparse, ensuring that the latent representations and reconstructed features remain close to the original sparse structure.

Given the latent representations \( z \in \mathbb{R}^{d'} \), the sparsity regularization is applied to the encoder activations to control the average activation levels across the latent dimensions. Let \( \hat{\rho} \in \mathbb{R}^{d'} \) denote the mean activation of the latent variables \( z \) over all nodes, defined as:

\[
\hat{\rho}_i = \frac{1}{|\mathcal{V}|} \sum_{v \in \mathcal{V}} z_{v,i}, \quad \forall i \in [1, d'].
\]

We introduce a sparsity target \( \rho \in (0,1) \) that specifies the desired level of activation for each latent variable. The sparsity loss \( \mathcal{L}_{\text{sparse}} \) is then defined as the Kullback-Leibler divergence between the desired activation \( \rho \) and the average activation \( \hat{\rho} \):

\[
\mathcal{L}_{\text{sparse}} = \sum_{i=1}^{d'} \left( \rho \log \frac{\rho}{\hat{\rho}_i} + (1 - \rho) \log \frac{1 - \rho}{1 - \hat{\rho}_i} \right).
\]

This loss term encourages the activations to stay close to the sparsity target \( \rho \), penalizing deviations from this target. A scaling factor \( \beta \) is used to adjust the contribution of this term, and the overall loss function for training the \texttt{VAE} becomes:

\[
\mathcal{L} = \lambda_{\text{rec}} \mathcal{L}_{\text{rec}} + \lambda_{\text{kl}} \mathcal{L}_{\text{kl}} + \beta \mathcal{L}_{\text{sparse}},
\]
where \( \mathcal{L}_{\text{rec}} \) is the reconstruction loss, \( \mathcal{L}_{\text{kl}} \) is the KL-divergence loss, and \( \lambda_{\text{rec}} \), \( \lambda_{\text{kl}} \), and \( \beta \) are weights controlling the relative importance of each term.

Incorporating this sparsity regularization helps ensure that the generated node features remain representative of the original sparse input data, improving the quality and fidelity of the reconstructed features in graph-based learning tasks.

\section{Complexity Analysis}
\label{sec:complexity}

In this section, we provide a complexity analysis of the proposed method, focusing on the communication overhead between the clients and the central server, as well as the computational cost related to the exchange and utilization of generated node features.

\subsection{Prototype Sharing Complexity}
After training the \texttt{VAE}, each client \( k \) identifies \( M_k \) cluster centers, representing the prototype features that will be shared with the central server. The dimensionality of each prototype is \( d \), and the total communication cost of sending the prototype features from client \( k \) to the server is:

\[
\mathcal{O}(M_k \cdot d).
\]

Since there are \( K \) clients in total, the overall communication complexity for sending prototypes to the server is:

\[
\mathcal{O}(K \cdot M_k \cdot d),
\]

where \( M_k \) may vary across clients but is typically constant for simplicity.

\subsection{Server Aggregation Complexity}
The central server aggregates the prototype features from all clients, combining them into a global set of features \( \hat{X} = \bigcup_{k=1}^K \{c_m^k\} \). This aggregation step involves concatenating the received prototypes, which has a complexity of:

\[
\mathcal{O}(K \cdot M_k \cdot d).
\]

The server then broadcasts the aggregated prototypes back to all clients. The communication complexity of broadcasting the prototypes from the server to all clients is:

\[
\mathcal{O}(K \cdot M_k \cdot d),
\]

assuming all clients receive the same set of \( (K-1) \cdot M_k \) prototypes. Thus, the total communication cost for the prototype-sharing phase (sending prototypes to the server and broadcasting them back) is:

\[
\mathcal{O}(2 \cdot K \cdot M_k \cdot d).
\]

\subsection{Federated Training Communication Complexity}
During the federated training of the \texttt{VGAE} model, each client \( k \) sends its local model updates \( \Theta_k \) to the central server. The model parameters \( \Theta_k \) are of size \( |\Theta| \), which is the same across all clients. The communication complexity for each client sending its updated model to the server is:

\[
\mathcal{O}(|\Theta|).
\]

The server aggregates the model updates from all \( K \) clients, which involves summing the model parameters. The complexity of this aggregation step is:

\[
\mathcal{O}(K \cdot |\Theta|).
\]

The server then sends the updated global model back to each client, with a communication complexity of:

\[
\mathcal{O}(K \cdot |\Theta|),
\]

since each client receives the full set of model parameters. Thus, the total communication complexity for one round of federated training is:

\[
\mathcal{O}(2 \cdot K \cdot |\Theta|).
\]

\subsection{Overall Communication Complexity}
The overall communication complexity of the proposed method consists of two main components: (1) prototype sharing and (2) federated training. The total communication complexity is the sum of these two components, which can be expressed as:

\[
\mathcal{O}(2 \cdot K \cdot M_k \cdot d + 2 \cdot K \cdot |\Theta| \cdot R).
\]

This complexity scales linearly with the number of clients \( K \), the number of prototypes \( M_k \), the number of training epochs \(R\), and the size of the model \( |\Theta| \). Therefore, the communication overhead remains manageable, even as the number of clients and the model size increase.

\section{Datasets Statistics}
\label{appendix:3}

We used the largest connected components of Cora, CiteSeer, PubMed \citep{yang2016revisiting}, and Amazon Computers \citep{shchur2018pitfalls} datasets in the Pytorch Geometric package \citep{fey2019fast}. Dataset statistics are given in Table \ref{datasets}.

\begin{table}[h]
\centering
\caption{Dataset statistics.} 
\label{datasets}
\begin{tabular}{lcccc}
\hline
\textbf{Dataset}   & \textbf{\# Nodes} & \textbf{\# Edges} & \textbf{\# Features} & \textbf{\# Labels} \\ \hline
Cora               & 2485             & 10138            & 2485                & 7                  \\
CiteSeer           & 2120             & 7358            & 3703                  & 6                  \\
PubMed               & 19717            & 88648           & 500                & 3                  \\
Computers                 & 13752           & 491722           & 767                & 10                 \\ \hline
\end{tabular}
\end{table}


\section{Comparison of Non-Conformity Scores}
\label{appendix:nonconformity_comparison}

Regularized Adaptive Prediction Sets (RAPS) \citep{angelopoulos2020uncertainty} refine APS by introducing regularization to penalize less likely labels. RAPS modifies the score function to include a regularization term, encouraging smaller prediction sets. The score function is defined as 
\[s(x, y) = - (\rho(x, y) + u \cdot \pi(x)y + \nu \max(o(x, y) - k, 0))
\], where $\nu$ and $k$ are hyperparameters, and $o(x, y)$ represents the rank of $y$.

Least Ambiguous Set-Valued Classifiers (LAC) \citep{sadinle2019least}assess classification uncertainty. The classifier's score, \( s(x, y) \), is given by:
\[
s(x, y) = 1 - [f(x)]_y
\]
where \( [f(x)]_y \) represents the score of the true label, thus quantifying the classifier's confidence in its prediction.


Table \ref{non-conformity} compares the CP set sizes when using APS, RAPS, and LAC as the non-conformity scores. While our proposed generative model improves efficiency across all three, LAC consistently yields the smallest prediction sets in most scenarios. However, a known trade-off exists between set size and coverage reliability, as noted in prior work. Our findings confirm this trade-off: Figure \ref{coverage} shows that while LAC produces tighter sets, it increasingly violates the desired $1 - \alpha$ coverage guarantee as the number of clients grows. Because APS and RAPS reliably maintain the target coverage, they are preferable for high-stakes applications.


 
\begin{table}[h]
\centering
\caption{CP set size comparison of non-conformity scores APS, RAPS and LAC on Cora dataset with partition number $K = 3, 5, 10$ and $20$. Set sizes are presented for $1 - \alpha = 0.95, 0.90$, and $0.80$ confidence levels. The corresponding std. are given with an averaged set size over 10 runs.}
\label{non-conformity}
\setlength{\tabcolsep}{6pt}
\scalebox{0.8}{
\begin{tabular}{@{}lccc ccc@{}}
\cmidrule(lr){2-7}
& \multicolumn{1}{c}{\textbf{APS}} & \multicolumn{1}{c}{\textbf{RAPS}} & \multicolumn{1}{c}{\textbf{LAC}} 
& \multicolumn{1}{c}{\textbf{APS}} & \multicolumn{1}{c}{\textbf{RAPS}} & \multicolumn{1}{c}{\textbf{LAC}} \\
\cmidrule(lr){2-4} \cmidrule(lr){5-7}
& \multicolumn{3}{c}{\textbf{$K=3$}} & \multicolumn{3}{c}{\textbf{$K=5$}} \\
\cmidrule(lr){2-4} \cmidrule(lr){5-7}
\texttt{Fed (0.95)}   & 4.31\std{0.02} & 2.57\std{0.02} & \textbf{1.79}\std{0.01}  & 4.94\std{0.02} & 2.97\std{0.01} & \textbf{2.59}\std{0.03}\\     
\texttt{Gen (0.95)}   & 4.25\std{0.02} & 2.22\std{0.01} & \textbf{1.58}\std{0.02}  & 5.09\std{0.02} & 2.82\std{0.02} & \textbf{2.53}\std{0.04}\\
\cmidrule(lr){2-4} \cmidrule(lr){5-7}
\texttt{Fed (0.90)}   & 3.34\std{0.03} & 1.85\std{0.01} & \textbf{1.19}\std{0.01}  & 4.14\std{0.03} & 2.33\std{0.02} & \textbf{1.64}\std{0.02}\\     
\texttt{Gen (0.90)}   & 3.34\std{0.02} & 1.69\std{0.02} & \textbf{1.12}\std{0.01}  & 4.10\std{0.02} & 2.12\std{0.03} & \textbf{1.61}\std{0.01}\\
\cmidrule(lr){2-4} \cmidrule(lr){5-7}
\texttt{Fed (0.80)}   & 2.45\std{0.01} & 1.36\std{0.01} & \textbf{1.01}\std{0.01}  & 2.95\std{0.01} & 1.63\std{0.01} & \textbf{1.04}\std{0.00}\\     
\texttt{Gen (0.80)}   & 2.51\std{0.03} & 1.27\std{0.02} & \textbf{1.00}\std{0.00}  & 2.98\std{0.05} & 1.52\std{0.02} & \textbf{1.04}\std{0.00}\\
\cmidrule(lr){2-7}

& \multicolumn{3}{c}{\textbf{$K=10$}} & \multicolumn{3}{c}{\textbf{$K=20$}} \\
\cmidrule(lr){2-4} \cmidrule(lr){5-7}
\texttt{Fed (0.95)}   & 5.02\std{0.02} & \textbf{3.50}\std{0.01} & 3.82\std{0.02}  & 5.79\std{0.02} & \textbf{5.16}\std{0.02} & 5.64\std{0.01}\\     
\texttt{Gen (0.95)}   & 4.86\std{0.02} & \textbf{3.39}\std{0.03} & 3.39\std{0.04}  & 5.40\std{0.02} & \textbf{4.92}\std{0.05} & 5.06\std{0.03}\\
\cmidrule(lr){2-4} \cmidrule(lr){5-7}
\texttt{Fed (0.90)}   & 4.32\std{0.02} & 2.61\std{0.01} & \textbf{2.06}\std{0.01}  & 4.13\std{0.01} & 3.78\std{0.01} & \textbf{3.37}\std{0.00}\\     
\texttt{Gen (0.90)}   & 3.98\std{0.01} & 2.55\std{0.02} & \textbf{2.00}\std{0.01}  & 3.90\std{0.04} & 3.55\std{0.03} & \textbf{3.05}\std{0.01}\\
\cmidrule(lr){2-4} \cmidrule(lr){5-7}
\texttt{Fed (0.80)}   & 2.93\std{0.03} & 1.79\std{0.01} & \textbf{1.19}\std{0.00}  & 3.17\std{0.03} & 2.92\std{0.01} & \textbf{2.27}\std{0.01}\\     
\texttt{Gen (0.80)}   & 2.92\std{0.02} & 1.73\std{0.01} & \textbf{1.14}\std{0.02}  & 2.88\std{0.03} & 2.50\std{0.01} & \textbf{1.73}\std{0.03}\\

\bottomrule
\end{tabular}
}
\end{table}

\begin{figure}[t]
\centering
\includegraphics[width=120mm]{figures/coverage_aps_raps_lac.png}
\caption{\textbf{Coverage rates with different non-conformity scores for \texttt{Fed} model across varying \( K \) on the Cora dataset.}} \label{coverage}
\end{figure}


\section{Impact Quantile Averaging Methods}
\label{appendix:impact_of_quantile}

In this study, we evaluated the performance of two distributed quantile estimation methods, T-Digest \citep{dunning2021t} and quantile averaging \citep{luo2016quantiles}, with respect to their impact on conformal prediction set sizes. T-Digest is a probabilistic data structure optimized for the estimation of quantiles in extensive and distributed datasets, facilitating real-time analysis. Its mergeable nature enables effective aggregation of summaries across parallel, distributed systems, ensuring statistical efficiency and scalability. As shown in Figure \ref{digest}, T-Digest produces larger set sizes across various configurations of confidence levels and number of clients. We found quantile averaging is more effective at reducing model uncertainty.

\begin{figure}[t]
\centering
\includegraphics[width=100mm]{figures/tdigest.png}
\caption{\textbf{Comparison of T-Digest and quantile averaging methods by confidence level on Cora dataset.}} \label{digest}
\end{figure}

\section{Differential Privacy Analysis}\label{sec:dp}

In our framework, we apply \(\epsilon\)-\(\delta\) differential privacy (DP) \citep{dwork2014algorithmic} specifically to the node prototypes generated by the VAE before they are shared with the server. The subsequent federated training of the downstream GCN and VGAE models is performed without DP guarantees; therefore, we do not claim end-to-end privacy for the entire system. Thanks to the post-processing immunity property of DP, any downstream use of the DP-protected prototypes does not degrade the initial privacy guarantee. This section explores the integration of DP into the node generation process.

Under DP, a randomized mechanism \(\mathcal{M}\) satisfies \(\epsilon\)-\(\delta\) privacy if for any two neighboring datasets \(D\) and \(D'\), the following holds:

\[
\Pr[\mathcal{M}(D) \in S] \leq e^\epsilon \Pr[\mathcal{M}(D') \in S] + \delta,
\]
where \(\epsilon > 0\) controls the privacy loss and \(\delta\) accounts for the probability of a privacy breach.

The node generator model is trained using the Opacus library to implement privacy-preserving stochastic gradient descent (DP-SGD), which ensures that each client’s data is protected by clipping gradients and adding Gaussian noise. This technique introduces an additional noise term to the training process, making it difficult for an adversary to infer individual node features while still enabling useful feature generation. The impact of DP on model performance is explored by varying the privacy budget \(\epsilon\) and fixing \(\delta = 10^{-5}\).

\subsection{Performance with Varying Privacy Budgets}

We evaluate the effectiveness of our node generation method under different privacy budgets by training the \texttt{Gen} method with \(\epsilon\) values ranging from 1 to 25. We analyze the impact of privacy noise on the RAPS non-conformity scores for various \(1-\alpha\) values (ranging from 0.5 to 0.95), comparing the results against the non-private \texttt{Fed} and \texttt{Gen} methods.

Figure \ref{fig:raps_heatmap} presents the observed scores. We note that as the privacy budget decreases (i.e., smaller \(\epsilon\) values), the performance of the \texttt{Gen} method degrades slightly, particularly for larger \(1-\alpha\) values. This degradation is expected due to the additional noise introduced by the DP mechanism, which affects the accuracy of the generated node features. However, even with \(\epsilon = 1\), the degradation remains relatively small, demonstrating that our approach maintains robust performance under strict privacy constraints.



\begin{figure}[htbp]
    \centering
    \includegraphics[width=0.55\linewidth]{figures/raps_heatmap.png}
    \caption{\textbf{Heatmap showing RAPS non-conformity scores for \texttt{Fed} and \texttt{Gen} methods across various \(\epsilon\)-values and \(1-\alpha\) values on 3 client Cora dataset.}}
    \label{fig:raps_heatmap}
\end{figure}


From the results, we observe that at \(\epsilon=10\), the privacy-preserving \texttt{Gen} model closely approximates the performance of the non-private \texttt{Gen} method across all \(1-\alpha\) values. However, with stricter privacy budgets (e.g., \(\epsilon=1\)), there is a marginal increase in non-conformity scores, indicating a slight decrease in accuracy due to the added noise. Despite this, the model remains competitive even under the strictest privacy constraints.


Our experiments show that incorporating \(\epsilon\)-\(\delta\) differential privacy into the node generation process enables strong privacy guarantees with minimal impact on performance. Even under the strictest privacy settings, the model retains its ability to generate useful node features, as evidenced by the modest increases in RAPS non-conformity scores.


\vfill


