\input{results/combine}
We evaluate our method on two challenging clinical tasks: cancer stage classification and patient survival prediction. Our experimental framework, including the datasets, task definitions, and baseline models used for comparison, is detailed below.
\subsection{Experimental Setup}
\noindent\textbf{Datasets and Tasks.} We evaluate our method on two TCGA cohorts: Breast Invasive Carcinoma (BRCA, $N=1,101$)~\cite{tcga-brca} and Uterine Corpus Endometrial Carcinoma (UCEC, $N=539$)~\cite{tcga-ucec}.
We benchmark performance on two clinical tasks: (1) Tumor Stage Classification, determining pathological stage directly from WSIs, and (2) Patient Survival Prediction, identifying risk groups for prognostic assessment.

The first task, cancer stage classification, involves predicting the pathological stage of the tumor directly from the WSI. Cancer staging is a critical component of clinical oncology, as it describes the extent of cancer progression
and acts as a basis to determine treatment plans. The stage is typically described on a scale from I to IV, with higher stages indicating more advanced disease.
Since the task is a multi-class problem, we use cross-entropy loss for all models.

The second task, survival prediction, stratifies patients into risk groups based on their
predicted prognosis. Accurate survival prediction is essential for personalizing treatment,
managing patient care, and identifying high-risk individuals who may benefit from more
aggressive therapies.
For training we use Negative log-likelihood survival loss~\cite{zadeh2020bias}.
We include the full details on all hyperparameters in \appendixref{app:hyperparameter}.


\noindent\textbf{Baselines.} We compare our framework against two categories of methods: The first group consists of efficient baselines and includes representative MIL methods (DeepSets~\cite{zaheer2017deepSets}, ABMIL~\cite{ilse2018abmil}) and graph-based approaches (GraphTransformer~\cite{zheng2022GraphTransformer}, DM-GNN~\cite{wang2024dmgnn}) that utilize a comparable amount of data and have a similar model size. 
To represent standard deep learning based histopathology workflows, we utilize established pre-trained encoders for these baselines: GraphTransformer employs its original patch encoder, while DeepSets, ABMIL, and DM-GNN utilize the widely adopted ResNet-50 features from CLAM~\cite{lu2021data}, following the protocol in~\cite{wang2024dmgnn}.
The second comparison category features UNI2-h~\cite{chen2024FoundationWSI}, a large-scale vision foundation model pre-trained on more than 100 million pathology patches. By leveraging massive data and computational resources to learn highly generalizable morphological representations, UNI2-h provides an upper-bound benchmark that contextualizes the performance of our method.

\noindent\textbf{Evaluation Protocol.}
For all methods we run five independent hyperparameter sweeps on different training data splits, select a model based on validation set performance and evaluating on a separate test set (see \appendixref{app:experiment} for more details). All results report the mean and standard deviation across five independent folds with site-level splitting to minimize batch effects~\cite{howard2021splits}.

For staging, we report Area Under the Receiver Operating Characteristic Curve (AUC), Balanced Accuracy, and macro-averaged F1-score ($\text{F}1_\text{m}$) to account for class imbalance. For the survival prediction, we report the Concordance Index (c-index)~\cite{harrell1982cindex}. 

\subsection{Performance Analysis}
\tableref{tab:results-combined} summarizes the performance of our proposed framework across cohorts and tasks. Our method establishes a new state-of-the-art among resource-efficient baselines, outperforming existing MIL and graph-based approaches on the majority of primary metrics across both datasets.

\noindent\textbf{Comparison with Efficient Baselines.} In the Stage classification task on TCGA-BRCA, our method achieves the highest AUC and $\text{F}1_\text{m}$ scores.
We report an AUC of 67.2, surpassing the strongest baseline (GraphTransformer) by a clear margin of 3.9 points.
Notably, our $\text{F}1_\text{m}$-score of 28.0 represents a relative improvement of over 20\% compared to the runner-up (DM-GNN: 23.2).
This dominance extends to survival prediction, where we outperform all comparable baselines (c-index: 62.9 vs 61.7).

The trend of strong performance continues on the smaller TCGA-UCEC dataset, demonstrating the robustness and good generalization of our approach. 
It achieves the highest AUC (56.4) and leads in survival prediction (\mbox{c-index:} 60.0 vs 58.3). While the margins are tighter on this smaller cohort, our method shows consistently strong performance and is only close second in the $\text{F}1_\text{m}$-score (19.5 vs 20.6 \mbox{DM-GNN}).

We attribute this advantage to our fundamentally different approach to represent tissue. Standard MIL models discard spatial context, while patch-graph methods often impose artificial grids that fragment biological entities. By aligning nodes with natural tissue boundaries and enriching them with interpretable clinical features, our model captures subtle, stage-determining morphological signals that grid-constrained topologies miss.

\subsection{Efficiency vs. Foundation Models}
While foundation models rely on scale, our approach demonstrates that an intelligent representation design can achieve competitive performance with a fraction of the resources.
On TCGA-BRCA, our model marginally exceeds the performance of UNI2-h in survival prediction (c-index: 62.9 vs 62.1) and remains statistically comparable in staging. This parity is achieved despite massive disparities in resource usage (see \appendixref{app:calculation} for details):
\begin{itemize}
    \item \textbf{Data scale}: Including the training of models used for feature extraction, our method still consumed 300$\times$ less data compared to UNI2-h.
    \item \textbf{Feature interpretability}: Beyond efficiency, our model is interpretable by design. Predictions can be traced back to specific, biologically-grounded regions and a curated set of clinically-motivated features. This stands in stark contrast to the black-box features used by most other models and UNI2-h.
\end{itemize}


\subsection{Ablation Studies}
\input{results/ablation_all_nouni}
To validate the contribution of each component in our proposed pipeline, we conducted a series of ablation studies on the TCGA-BRCA dataset. We investigate four critical design dimensions: impact of architecture vs. features, feature redundancy pruning ($\xi$), graph topology construction via region merging ($\tau$), and the synergy between different feature modalities. The results for the architecture vs feature ablation are shown in \tableref{tab:ablationrev1}, while the other results are summarized in \tableref{tab:ablation}.

\noindent\textbf{Architecture vs features:} To disentangle the performance gains attributed to our graph-based method from those driven by the interpretable feature set, we evaluated all combinations of architectures and features (see \tableref{tab:ablationrev1}). We compared our proposed framework (Graph + Interpretable) against the standard MIL baseline (ABMIL) and the widely used, learned patch embeddings from a pre-trained ResNet-50 (CLAM~\cite{lu2021clam}). Three key insights emerge from this analysis.

\noindent First, we observe that our method relies on both the graph architecture and the feature set, showing the strongest performance in this configuration. When using interpretable features within a standard ABMIL framework, performance drops significantly on both BRCA (AUC: 67.2 to 58.9; C-Index: 62.9 to 51.0) and UCEC (AUC: 56.4 to 52.1; C-Index: 60.0 to 51.3). This indicates that local, interpretable statistics (such as nuclear density or texture) lose their predictive value for these challenging tasks when aggregated globally without spatial context.


\noindent Second, the interpretable feature set can be a replacement for learned embeddings in a MIL setting when applied to the task of stage prediction. The performance is comparable, but the interpretability of such a combination would be greatly improved. However, we observe that this is not the case of survival prediction, where interpretable features in a MIL setting perform significantly worse compared to ResNet-50 embeddings on both tasks.


\noindent Third, using learned features in combination with a graph instead of a MIL architecture, does neither hurt nor boost performance significantly, with the exception of survival prediction on BRCA where MIL clearly performs better (ABMIL: 61.7 vs Graph: 51.4). 

\input{results/ablation}

\noindent\textbf{Graph Construction ($\tau$):} The region merging threshold $\tau$ exhibits a trade-off between abstraction and detail.
Lower values of $\tau$ merge larger, more heterogeneous regions, while higher values preserve more fine-grained details.
We observe that merging regions too aggressively ($\tau=0.5$) leads to significant information loss.
Additionally, preserving every superpixel ($\tau=1.0$) creates an overly dense graph, increasing computational complexity and introducing high-frequency noise.
A moderate $\tau=0.95$ balances the simplification of homogeneous areas with the preservation of heterogeneous details, yielding a strong performance across all metrics.

\noindent\textbf{Feature Pruning ($\xi$):} The correlation threshold $\xi$ governs the trade-off between feature dimensionality and information retention.
A high correlation threshold ($\xi=0.99$) effectively removes redundant features without losing information, whereas retaining all features ($\xi=1.0$) leads to slight overfitting.
Conversely, setting the threshold too low ($\xi=0.95$) overly simplifies the feature space, discarding subtle but discriminative signals.

\noindent\textbf{Feature Groups:} The full combination of all three modalities yields the best overall performance (65.4 AUC, 25.3 F1$_m$), confirming the hypothesis that a holistic view of the tissue microenvironment is superior to any single descriptor.
While nuclear features are the most discriminative individual group (62.9 AUC), the inclusion of texture and morphology significantly boosts the overall performance and robustness to class imbalances, confirming the value of a holistic tissue representation.

\noindent Since the cost of multiple runs with different similarity thresholds is comparatively low, we would recommend it when the framework is applied to new datasets.

\subsection{Qualitative Analysis and Interpretability}
\begin{figure*}[t]
\includegraphics[width=\textwidth]{figures/explainability-1.pdf}
\caption{Explanations for two TCGA-BRCA Stage 2 samples. Red overlays indicate influential regions (via Integrated Gradients). We also list the top-attributed interpretable features compared to dataset statistics to highlight biological drivers.}
\label{fig:explainability}
\end{figure*}


The goal of computational pathology is not just predictions, but actionable insights.
A key motivation for moving beyond patch-based black boxes is the need for trustworthy and clinically relevant explanations.
Our framework delivers on this promise.

Leveraging the attribution scores derived via Integrated Gradients (see \sectionref{sec:explainability}), we attribute predictions to specific tissue regions and feature sets (see \figureref{fig:explainability}).
We note that Integrated Gradients, being an additive attribution method, can not highlight interactions of distinct regions or features. However, since the method models interactions of regions and features (via graph layer), the final attribution score represents a node's contribution given the graph structure and the feature interactions learned by the model.

To validate the clinical utility of these explanations, we conducted a preliminary qualitative review with a pathology resident. The expert confirmed that the regions identified by the model as most important align closely with diagnostically relevant tissue structures. Specifically, the model frequently highlighted direct tumor tissue and inflammatory infiltrates at tumor boundaries, both of which are critical for staging and prognosis. Furthermore, the expert noted that the model’s tendency to focus on a limited number of small, informative regions mirrors actual clinical workflows, where diagnosis is often driven by distinct morphological regions rather than a uniform assessment of the whole slide.

The attribution to specific interpretable features also received positive validation. The expert noted that explicitly using nuclear statistics, such as the count and size deviations of specific subtypes (e.g., neoplastic vs. inflammatory), provides highly informative, clinically grounded evidence. For instance, in \figureref{fig:explainability} (left), identifying regions with high density of necrotic nuclei as Stage 2 drivers aligns with common grading criteria.

However, this expert review also underscored the necessity of critical oversight. In \figureref{fig:explainability} (right), while the model correctly flagged "unusual color" (redness) as a statistical deviation contributing to the prediction, the pathologist cautioned that such color variations can sometimes stem from staining artifacts rather than biological signals. This distinction highlights the value of a transparent feature set: by explicitly naming "color" as the driver, the model allows the expert to accept valid morphological signals (like nuclear size) while potentially discounting artifacts, a level of scrutiny impossible with black-box embeddings.

As shown in these examples, we further contextualize these explanations by comparing the most attributed features to statistics derived from the training dataset. This allows experts to immediately discern whether a prediction is driven by a deviation from common patterns or the presence of a specific, rare indicator, thereby building confidence in the model's findings.
We provide additional qualitative examples in \appendixref{app:examples}.