\section{Introduction}
\label{sec:intro}
As histopathology digitization becomes routine, incorporating computational models into diagnostic workflows is increasingly feasible~\cite{hanna2019implementation,kumar2020whole,zhang2025patches}. These computational models provide slide-level classification results together with interpretable justifications, promoting consistent decisions and transparent verification~\cite{tizhoosh2018artificial,yilmaz2024advancing}. This is particularly valuable for rare diseases, where expert diagnosticians are scarce. However, a fundamental challenge lies in the gigapixel scale of whole-slide images (WSIs), which prevents them from being processed as a single image. In practice, the standard approach involves tiling tissue regions into thousands of patches, formulating the task as a Multiple Instance Learning (MIL) problem.

The evolution of MIL for WSI classification has shifted from simple feature pooling to sophisticated context modeling. Initial frameworks adopted static aggregation strategies, such as max-pooling~\cite{campanella2019clinical} and mean-pooling. While computationally efficient, these methods often lose critical contextual information by focusing only on the extreme feature or diluting signals through averaging. The introduction of Attention-based MIL (ABMIL)~\cite{ilse2018attention} marked a pivotal advancement by using trainable weights to rank instances. Subsequent research has sought to address overfitting and attention concentration through advanced strategies: pseudo-bag augmentation and feature distillation methods like DTFD-MIL~\cite{zhang2022dtfd}; and attention-challenging frameworks such as ACMIL~\cite{zhang2024attention} and MHIM~\cite{tang2023multiple} that mitigate attention concentration by suppressing high-confidence instances to encourage the discovery of comprehensive diagnostic patterns. Despite these improvements, the attention mechanisms often treat instances as independent and identically distributed (i.i.d.). To explicitly capture inter-instance correlations, recent sequence-based works like TransMIL~\cite{shao2021transmil} and the Mamba-based architecture~\cite{yang2024mambamil} leverage self-attention and selective scan mechanisms to explicitly model long-range dependencies, marking a paradigm shift towards correlated feature learning.

Running parallel to sequence-based advancements, Graph Neural Networks (GNNs) have emerged as a distinct paradigm focused on explicitly encoding the structural topology of the tissue~\cite{brussee2025graph}. By representing patches as nodes and their interactions as edges, these methods avoid flattening the spatial structure into a sequence. Early implementations employed $k$-nearest neighbor ($k$NN) algorithms to construct spatial graphs, demonstrating that explicitly modeling local neighborhoods enhances diagnostic accuracy~\cite{chen2021whole,zheng2022graph}. Subsequent research has explored more intricate graph constructions, including hierarchical formulations for multi-resolution reasoning~\cite{hou2022h} and heterogeneous graphs that distinguish between different tissue components~\cite{chan2023histopathology}. However, the "over-smoothing" phenomenon~\cite{chen2020simple} is challenging for graph-based MIL approaches. Stacking multiple message passing layers induces node representations to become homogenized, losing the discriminative power essential for classification. This degradation poses an obstacle in realistic clinical settings, which are characterized by extreme heterogeneity in tissue scale. In such diverse scenarios, the fact that applying standard readout functions to homogenized features yields inconsistent diagnostic profiles across varying graph sizes, harming the reliability required for clinical deployment.

Motivated by these challenges, we propose ResGAT, a weakly supervised MIL framework for whole slide image subtype classification. The whole slide image is represented as a hybrid $k$-NN patch graph with nodes initialized by extracted patch features and connected via spatial and feature proximity. ResGAT processes the patch graphs with stacked residual graph attention blocks, where each block features a dual-branch design combining multi-head graph attention with a parallel linear projection. This design preserves patch-specific information while adaptively aggregating contextual information, yielding representations that support effective slide-level prediction. In comparative evaluations against representative MIL baselines, our model achieves superior classification performance on both a rare, class-imbalanced appendiceal cancer cohort and the multi-class BRACS dataset, while it remains competitive on two public TCGA datasets. On the appendiceal cancer cohort, we also introduce a benchmarking protocol to assess cross-site generalization and few-shot adaptation, demonstrating that ResGAT maintains strong performance when labeled data are limited in new domains. An ablation study is provided to examine the effectiveness of the core components of ResGAT. Furthermore, the framework supports qualitative interpretation through heatmaps that highlight prediction-relevant regions.