Our framework transforms WSIs into compact, interpretable graphs, suitable for clinical tasks and explainable predictions.
This transformation addresses the core challenges of computational pathology: the immense scale of the data, the need for both local features and global connections, and the benefit of explainable models in a clinical context.

The proposed pipeline( see \figureref{fig:method}) constitutes a multi-stage process that moves from a low-level pixel representation to a high-level graph suitable for clinical prediction tasks.

The pipeline consists of four main steps. First, the WSI is segmented into a large set of small, biologically aligned regions using superpixels.
Second, these fine-grained regions are adaptively merged based on their semantic similarity, creating a coarsened graph that reflects the macroscopic tissue organization.
Third, each node in this coarsened graph is enriched with a comprehensive set of interpretable, domain-informed features that capture texture, morphology, and nuclear characteristics.
Finally, a Graph Attention Network (GAT) is applied to this final graph representation to perform slide-level classification tasks.

\subsection{Superpixel segmentation}
The first step of our framework is to convert the raw pixel data of a WSI into a meaningful set of initial regions that respect the boundaries of tissue structures.
More formally, given a WSI $\mathcal{I} \in \mathbb{R}^{H\times W\times 3}$, we identify tissue foreground $\mathcal{T} \subset \mathcal{I}$ by applying Otsu's thresholding on the HSV saturation channel and morphological cleaning, as described in~\cite{lu2021clam}.

With the tissue area $\mathcal{T}$ identified, we then partition it into an initial set of small, perceptually meaningful regions. Instead of using a rigid grid of square patches, we segment the tissue area into superpixels, $\mathcal{S} = \{s_1, ..., s_K\}$, using Simple Linear Iterative Clustering (SLIC)~\cite{radhakrishna2012SLIC} on a low-resolution version of the WSI ($0.625\times$ magnification).
SLIC is a variant of k-means clustering, grouping spatially close pixels with similar colors, creating small, irregularly shaped regions that naturally adhere to local tissue boundaries (see \figureref{fig:slic}), such as gland edges as the interface between tumor and stroma.
The number of superpixels $K$ is chosen to target an average size of $300\times 300$ pixels at a magnification level of x32.
This oversegmentation forms the basis of our initial region-adjacency (RA) graph $\mathcal{G}_0 = (\mathcal{V}_0, \mathcal{E}_0)$, where each node $v_i \in \mathcal{V}_0$ corresponds to a superpixel $s_i$, and an edge $(v_i, v_j) \in \mathcal{E}_0$ exists if superpixels $s_i$ and $s_j$ are spatially adjacent.


\subsection{Adaptive graph coarsening}
The initial graph $\mathcal{G}_0$ is too fine-grained for efficient downstream processing.
To address this, we introduce an adaptive graph coarsening procedure designed to merge semantically similar, adjacent regions into larger super-nodes.
The goal is to produce a more abstract, computationally efficient graph, where nodes can represent larger, homogeneous tissue components.
The \textit{adaptive} nature of this process is key: it preserves high granularity in heterogeneous, complex regions while substantially simplifying large, uniform areas, thus reducing graph size while preserving the most important morphological boundaries.

\noindent\textbf{Region embeddings} The foundation of our coarsening strategy is a semantically rich representation for each graph node.
We generate a region embedding $h_i \in \mathbb{R}^{512}$ for each node $v_i$ using a ResNet-18 feature encoder. The feature encoder was pretrained on histology patches using contrastive learning, enabling it to capture fine-grained textural and morphological patterns \cite{zheng2022GraphTransformer}.
To adapt the irregularly shaped superpixels to the rectangular input required by the ResNet, we extract the bounding box of the region from the WSI and zero-mask (black out) all pixels outside the superpixel boundary. This ensures the input matches the ResNet's rectangular requirements while isolating the visual features of the specific region.



\noindent\textbf{Greedy merging} With these embeddings, we perform an agglomerative merging of nodes.
We compute the cosine similarity $S_{ij}$ for adjacent node pairs $(v_i, v_j)$ in the initial graph $\mathcal{G}_0$.

The merge process is performed greedily, starting with the pair of the highest similarity. Nodes are merged sequentially until the score of the next candidate pair falls below the predefined threshold $\tau$.
The similarity threshold $\tau$ is a hyperparameter that controls the final granularity of the graph.

When two nodes are merged, they are replaced by a single new node whose region is the union of the original nodes, and which inherits the edges of its predecessors.
Crucially, we do not recompute or average the embeddings for the newly formed region during the process. The decision to merge two adjacent regions is determined solely by the similarity of their initial, local embedding. This allows the algorithm to aggregate larger homogeneous areas that may exhibit gradual feature shifts (e.g., due to staining gradients) by relying on the local similarity of adjacent components.
This agglomerative process continues until no adjacent nodes have a similarity exceeding $\tau$.
The result is a coarsened graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ where $|\mathcal{V}|\leq|\mathcal{V}_0|$.

This adaptive coarsening ensures that our final graph is a compact yet faithful representation of the tissue's macro-architecture.


\subsection{Interpretable node features}
While the learned region embeddings $h_i$ used for coarsening are powerful, we do not use them as node features since they are inherently black-box and lack direct clinical interpretability.
To create a framework that is explainable by design, we engineer a separate set of domain-informed features that describe each node in the final coarsened graph.

The feature vector $x_i = [ x^{\mathrm{nuc}\top}_i, x^{\mathrm{tex}\top}_i, x^{\mathrm{morph}\top}_i ]^{\top} \in \mathbb{R}^{191}$ is a concatenation of three distinct feature groups, capturing nuclear, texture and intensity, and morphology and color features:
\begin{itemize}
    \item \textbf{Nuclear features} $x^{\text{nuc}}_i \in \mathbb{R}^{77} $: Nuclear morphology is a cornerstone of histopathological assessment. We leverage a pretrained HoVerNet model~\cite{graham2019Hovernet}, a frequently-used deep learning model for simultaneous nuclear instance segmentation and classification. For each region, we apply HoVerNet to detect individual nuclei and classify them into one of six types (see \appendixref{app:hover}). We then compute a rich set of statistics, including count, density, size, and shape characteristics for each nucleus type, providing a quantitative summary of the cellular composition of the region.
    \item \textbf{Texture and intensity features} $x^{\text{tex}}_i \in \mathbb{R}^{93}$: This group quantifies the micro-patterns within the tissue region. It includes features derived from the Gray-Level Co-occurrence Matrix (GLCM), such as contrast, correlation, and energy, which capture the spatial relationships between pixel intensities. It also includes Local Binary Pattern (LBP) features and general intensity as defined in the Pyradiomics library~\cite{vanGriethuysen2017Radiomics}.
    \item \textbf{Morphological and color features} $x^{\text{morph}}_i \in \mathbb{R}^{21} $: This set of features describes the size of the region and its color properties. We compute color distribution statistics (mean, variance) across multiple color spaces (RGB, HSV, LAB). These features capture tissue morphology.
\end{itemize}


\noindent 
To reduce redundancy in the high-dimensional initial feature vector, we perform correlation-based pruning. We iteratively remove one feature from any pair in the training data whose absolute Pearson correlation $|\rho|$ exceeds a threshold $\xi$, resulting in a more compact and ro\-bust feature set. This feature pruning is done once for the whole dataset and not for each individual sample.
A complete list of all features can be found in \appendixref{app:features}.

The set of node features $\mathcal{X} = \{x_i \mid v_i \in \mathcal{V}\}$ extends the previously defined coarsened graph to form the final, compact graph representation $\mathcal{G} = (\mathcal{V}, \mathcal{E}, \mathcal{X})$ of the WSI $\mathcal{I}$.


\subsection{Graph attention network for classification}
We employ a Graph Attention Network v2 (GATv2)~\cite{brody2022gatv2}, which takes our compact representation $\mathcal{G}$ as input and learns the relative importance of neighboring nodes for the final prediction.

After passing a node through a stack of GAT layers, we obtain a set of high-level node embeddings that are context-aware, incorporating information from their local neighboring tissue environment.
To arrive at a single slide-level prediction, a final graph readout function is applied.
This function aggregates all the node embeddings using a mean-pooling into a single graph-level feature vector, which is then fed into a standard multilayer perceptron (MLP).




\subsection{Explainability}
\label{sec:explainability}
To provide transparent explanations for the predictions, we employ Integrated Gradients (IG)~\cite{sundararajan2017axiomatic}. IG attributes the prediction output to input features by integrating gradients along a linear path from a baseline (zero vector) to the actual input. This yields a mathematically unique attribution score matrix of the same shape as the input node features $\mathcal{X}$.
We process these scores in two ways:
\begin{enumerate}
    \item To identify influential regions, we sum the absolute attribution scores across the feature dimension for each node, highlighting superpixels with the highest aggregate contribution.
    \item To interpret biological drivers, we rank specific features (e.g., nuclear density, skewness) based on the magnitude of their attribution scores within the most important regions.
\end{enumerate}