\section{Experiments}\label{sec:experiments}

\subsection{Dataset and Experimental Setup}
\subsubsection{Dataset}\label{sec:dataset}
\textbf{Appendiceal cancer cohort.} This cohort consists of 141 diagnostic WSIs of 92 patients with low-grade appendiceal mucinous neoplasm (LAMN) and mucinous adenocarcinoma (MAC). It is significantly imbalanced (LAMN:MAC = 32:15). Sourcing from both Wake Forest and Stanford introduces additional domain shift challenges. Clinically, MAC is regarded as the more aggressive subtype with worse prognosis than LAMN, so it is treated as the positive class when computing AUC and the reported F1-score corresponds to the positive label.

\textbf{TCGA datasets.} Two public datasets, NSCLC and ESCA, were curated from The Cancer Genome Atlas (TCGA) program ~\cite{tomczak2015review}. Both datasets involve distinguishing adenocarcinoma from squamous cell carcinoma: LUAD vs LUSC for NSCLC, and EAC vs ESCC for the esophagus ESCA. For evaluation, the clinically more aggressive squamous cell carcinoma was treated as the positive class when computing the AUC.

\textbf{BRACS dataset.} The BRACS dataset~\cite{brancati2022bracs} is a public dataset with 526 WSIs of various breast lesions. It contains seven diagnostic categories: normal (N), benign lesions (PB), usual ductal hyperplasia (UDH), atypical ductal hyperplasia (ADH), flat epithelial atypia (FEA), ductal carcinoma in situ (DCIS), and invasive carcinoma (IC). Due to severe class imbalance, both the AUC and F1-score were computed using macro-averaging across all categories.

Slide counts per diagnostic category are summarized in Table~\ref{tab:dataset_stats} in the Appendix. All tissue segmentation and patch extraction were performed at 20$\times$ magnification.

\subsubsection{Evaluation protocols}\label{sec:evaluation_protocols}
All experiments were conducted on an NVIDIA RTX A6000 GPU with 48GB memory. For feature extraction, we adopted the CLAM~\cite{lu2021data} preprocessing pipeline with HSV-based tissue segmentation and contour-based spatial sampling to identify tissue regions. Patch-level 1024-dimensional feature vectors were extracted using UNI~\cite{chen2024towards}(ViT-L/16 via DINOv2) with standard ImageNet normalization~\cite{5206848}.

For each cohort, we performed slide-level 5-fold cross-validation with patient-wise splits. In each split, three folds were used for training, one for validation and one for testing. We report the mean and standard deviation of metrics over the five test folds. Specifically, balanced accuracy is reported for the highly imbalanced Appendiceal cancer and BRACS cohorts, while overall accuracy is used for the TCGA cohorts.

For the domain adaptation analysis on the appendiceal cancer cohort, WF slides formed the source domain and SF slides the target domain. The WF data were partitioned into training, validation and test subsets in a 70/15/15 ratio for pre-training each model. For the target domain, we defined a fixed SF test set of 12 slides (10 LAMN and 2 MAC); this SF test set was used for all zero-shot and few-shot evaluations. Zero-shot performance was obtained by applying the WF-pretrained model directly to the SF test set. For few-shot adaptation, we fine-tuned the pretrained model on small labeled SF subsets with 3, 6 and 9 training slides and separate validation sets of 3, 3 and 5 slides, respectively. After adaptation, we report overall accuracy on the SF test set and quantify adaptation efficacy using backward transfer (BWT) and forward transfer (FWT). We use overall accuracy in this setting rather than balanced accuracy to provide a clearer view of adaptation trends. BWT is defined as the change in WF test accuracy before and after fine-tuning, where large negative BWT values indicate catastrophic forgetting. FWT is computed as the improvement of SF test accuracy over the zero-shot baseline, where positive values indicate successful adaptation.

See Appendix~\ref{sec:appendixA} for more implementation details.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \input{sec/tables/table1}
% \input{sec/tables/table_bracs}
\input{sec/tables/table1-4}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Comparison with state-of-the-art methods}
We compared our method with nine strong MIL baselines that cover diverse design paradigms: attention-based pooling MIL (CLAM-SB and CLAM-MB~\cite{lu2021data}), transformer-based MIL (TransMIL~\cite{shao2021transmil}), dual-stream MIL (DSMIL~\cite{li2021dual}), distillation-based MIL (DTFD-MIL~\cite{zhang2022dtfd}), graph-based MIL (WiKG~\cite{li2024dynamic} and PatchGCN~\cite{chen2021whole}), and hard-instance-mining MIL (MHIM-DSMIL and MHIM-TransMIL~\cite{tang2023multiple,tang2026multiple}). Table~\ref{tab:main_results}
reports mean and standard deviation over five folds for all metrics on the four datasets.

On the appendiceal cancer cohort, ResGAT achieves the highest balanced accuracy at 92.56$\pm$6.36\%, outperforming the best baseline CLAM-SB by roughly 2.5\% and yielding the lowest standard deviation across folds. It also attains the highest F1-score and a high AUC, indicating reliable detection of the clinically more aggressive MAC subtype. On TCGA-NSCLC and TCGA-ESCA, CLAM-SB attains the highest mean accuracy, while ResGAT remains competitive: its accuracy is only 0.21\% and 0.02\% below CLAM-SB on TCGA-NSCLC and TCGA-ESCA, respectively. Notably, ResGAT's low standard deviations on TCGA cohorts shows stable performance across folds. On BRACS, a challenging seven-class fine-grained classification task, ResGAT achieves the highest balanced accuracy and AUC among all methods, while MHIM-DSMIL obtains the highest F1-macro. The overall low balanced accuracy across all methods reflects the inherent difficulty of fine-grained breast lesion subtyping. Overall, these results indicate that ResGAT performs well on the class-imbalanced and label-noisy appendiceal cancer cohort and the BRACS dataset, while remaining comparable to competitive MIL baselines on other datasets.

The results also highlight the complementary strengths of other MIL approaches. On the two TCGA cohorts, DTFD-MIL obtains the second highest accuracies and AUCs, with CLAM-MB generally close behind. The MHIM variants (MHIM-DSMIL and MHIM-TransMIL) consistently improve over their backbones, and show the effectiveness of the hard-instance mining strategy. 

\subsection{Domain Adaptation Analysis}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{sec/tables/table2}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In this experiment, we evaluated cross-site robustness on the appendiceal cancer cohort, where WF and SF correspond to different acquisition sites (see Section~\ref{sec:dataset} for details). Such cross-site settings often introduce substantial distribution shift due to differences in scanners, staining protocols and local practice, and models trained on a single site can experience a marked performance drop when deployed elsewhere~\cite{liu2025hasd, pocevivciute2024detecting}. We therefore used this scenario to assess generalization ability of methods, which is a critical consideration for realistic clinical deployment. We first evaluated zero-shot performance, where a model trained on the source site is directly applied to the target site. Subsequently, we evaluate few-shot adaptation, where only a small number of labeled SF slides are available for finetuning the source-trained model (see Section~\ref{sec:evaluation_protocols} for details).

\subsubsection{Cross-domain Generalization}
Table~\ref{tab:domain-adaptation} compares our method with the same nine MIL baselines. While most MIL baselines achieve reasonably high accuracy on the WF source test set, their zero-shot performance on the SF target set is highly variable and often subtype-imbalanced. Several baselines, including WiKG, TransMIL and the CLAM variants, fail to correctly predict MAC samples during cross-site transfer, indicating a strong bias toward the majority subtype when crossing sites. In comparison, PatchGCN and DTFD-MIL achieve strong zero-shot performance on the SF test set, with per-class accuracies exceeding 90\%, suggesting robust initial cross-site generalization. ResGAT achieves the second-highest source-domain accuracy on the WF test set and provides competitive zero-shot accuracy on the SF test set, establishing a solid foundation for further adaptation.

\subsubsection{Few-shot Adaptation}

In this setting, we analyzed how pre-trained models adapt to target data when fine-tuned on a small number of labeled SF slides. ResGAT demonstrates superior adaptation efficiency, reaching 100\% overall accuracy on the SF test set at the 3-shot setting and maintaining this performance across the 6-shot and 9-shot settings. Its already high source test performance remains robust across all settings (BWT = 0), showing that adaptation does not induce forgetting on the source domain. This result suggests that ResGAT can be effectively adapted to a new site with only a small number of labeled slides, which is especially valuable in rare-disease scenarios where annotation is costly and limited.

PatchGCN maintains its perfect accuracy throughout all few-shot settings, suggesting that its graph-based representation captures site-invariant tissue structures. DTFD-MIL and MHIM-TransMIL show steady improvements on SF test accuracy as more target slides become available, alongside positive forward transfer (FWT) and BWT, indicating stable learning and knowledge retention under additional target supervision. By contrast, CLAM-SB and CLAM-MB, despite their strong performance on general classification tasks, show little change in SF test accuracy across all few-shot settings, suggesting that their architectures are less responsive to limited target supervision during cross-site adaptation.

\subsection{Ablation Study}
\subsubsection{Effectiveness of Proposed Edge Construction}\label{sec:ablation1}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{sec/tables/table5}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
To assess the contribution of the proposed graph construction, we compared several topology variants: Feature kNN (edges based on feature similarity), Spatial kNN (edges based on spatial proximity), Hybrid (our method with two settings of $d_{spa}$) and Node-permuted (hybrid adjacency with features randomly reassigned to nodes). For all graph variants, we use $k=6$; for the hybrid case, we vary the $d_{spa}$ hyperparameter while keeping all other settings fixed (see Section~\ref{sec:graph-construction} for details). As shown in Table~\ref{tab:graph-ablation}, the hybrid graph consistently provides the strongest overall performance across datasets, indicating that combining spatial proximity and feature similarity yields a more effective graph topology than using either criterion alone. Notably, the node-permuted variant remains competitive. This suggests that the adjacency structure itself provides structural regularization that stabilizes representation learning and mitigates overfitting.

We further investigated the robustness of the hybrid topology through a sensitivity analysis of its hyperparameters. Specifically, we performed a grid search over the number of spatial neighbors ($d_{spa}\in\{15, 24, 36, 48, 60\}$), feature neighbors ($d_{feat} \in\{35, 50, 65, 90, 105\}$), and the maximum neighborhood size ($k\in\{6, 8\}$). The resulting heatmap in Appendix Fig.~\ref{fig:hyperparam} visualizes the evaluation metrics across all four datasets. ResGAT exhibits stability over a broad spectrum of parameter combinations. Although the best parameter choices vary by dataset, the general configuration consistently provides strong performance across all datasets. Overall, these results demonstrate that the proposed edge construction is effective and robust across diverse datasets.

\subsubsection{Effectiveness of Residual Block Design}\label{sec:ablation2}

We conducted a set of experiments to evaluate key architectural design choices, including dual-branch architecture, normalization strategy, layer depth, and graph convolution type. First, we compared the performance of the full ResGAT model against two variants: one ablating the linear branch and another removing all inter-node edges, which degenerates the model into a node-wise MLP. As shown in Table~\ref{tab:ablation_linear_branch}, ablating the linear branch leads to a consistent drop in performance across datasets, indicating that direct patch-level feature propagation meaningfully complements graph aggregation. Disconnecting the inter-node edges also resulted in a decline in accuracy and balanced accuracy. Together, these findings confirm that the dual-branch architecture is essential. Preserving patch-specific features and aggregating topological context are both important for forming effective slide-level representations. Table~\ref{tab:norm-ablation} shows that GraphNorm provides the most favorable performance within ResGAT compared to LayerNorm and InstanceNorm. Specifically, it outperforms both alternatives on the appendiceal cancer and BRACS datasets, where its graph-level normalization statistics and learnable shift parameter offer a more expressive normalization strategy than the per-node or per-feature counterparts. 

Additionally, Appendix Table~\ref{tab:ablation_layer_conv_all} summarizes the impact of different layer depths and graph convolution types. For the layer depth study, the 2-layer variant removes the intermediate residual block with appropriate dimension alignment, while the 4-layer variant adds an additional block that preserves the output dimension of the third layer. The results show that a 3-layer configuration provides the best trade-off between performance and computational cost: 2-layer models generally underperform due to limited receptive fields, whereas increasing the depth to 4 layers incurs higher computational cost without yielding consistent improvements. Regarding the graph convolution type, we compared GAT against GCN, GIN, and GraphSAGE. While GIN performs comparably in specific instances, we adopt GAT as the default because it achieves the highest performance across the majority of metrics and remains the most stable choice across datasets.

We further evaluated computational efficiency by comparing the throughput of ResGAT against two other graph-based methods, WiKG and Patch-GCN, under the same training protocol. As detailed in Appendix Table~\ref{tab:compute_efficiency}, ResGAT achieves a throughput comparable to Patch-GCN and WiKG, demonstrating that the multi-layer residual block design enhances performance without incurring significant computational overhead.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{sec/tables/table6}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{sec/tables/table4}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Qualitative Results}\label{sec:qualitative}
We applied graph-adapted Grad-CAM++ (Section~\ref{sec:gradcam}) to visualize WSI heatmaps. Appendix Fig.~\ref{fig:2representatives} illustrates three representative MAC cases, where the primary heatmaps in the first row were computed as a confidence-weighted average of the top cross-validation models. The regions with high saliency scores are outlined in yellow. While heatmaps from individual folds exhibit spatial variation, the top-performing models show consensus in the regions they highlight, indicating that the network captures stable diagnostic patterns. Clinical review by our pathologist confirmed that the high-scoring patches predominantly correspond to tumor and stromal tissue, where tumor morphology and its spatial relationship with the stroma inform subtyping. This indicates that ResGAT's predictions are informed by histologically meaningful features.
