\section{Experiments and Results}
\label{sec:results}

\subsection{Dataset}
\label{sec:results:data}


We evaluate using 50 H$\&$E-stained prostate cancer tissue micro-array (TMA) histopathology images from the public Gleason 2019 Challenge dataset~\cite{Nir2018_TMAdata,Karimi2020_TMAdata_6path}. 
Each image acquired at 40$\times$ magnification ($\sim$5120$\times$$\sim$5120 pixels) is annotated by at least one expert pathologist, segmenting the image into benign and Gleason Grade 3, 4, and 5 categories. This dataset is partitioned into equal halves for training and testing.
For model pre-training, we used the histopathologic colon cancer dataset NCT-CRC~\cite{Kather2019-xy}, which contains 100,000 non-overlapping image patches from 9 different tissue classes of H$\&$E-stained tissue slides.



\subsection{Model Implementation and Baseline Comparisons}
\label{sec:results:baselines}


We used the conventional ResNet18 (ResNet)~\cite{He2016-de} as the non-equivariant CNN baseline. We further compared SRENet with the state-of-the-art rotation equivariant baseline E2CNN (E2CNN)~\cite{Weiler2019-yf} using a WideResNet-16~\cite{Zagoruyko2016-bh} backbone.
For pre-training, all models were trained for 50 epochs on the NCT-CRC training dataset with an image size of (224, 224) and a batch size of 24 using SGD optimization with a cosine annealing scheduler and a learning rate of $2 \times 10^{-2}$ and cross-entropy loss. 
As is standard practice for equivariant feature learning~\cite{Worrall2017-mv, Weiler2019-yf, Cohen2016-st}, no geometric data augmentation is applied during training to avoid confounding effects.
Detailed classification results of this pre-training task can be found in the Appendix (Sec.~\ref{sec:appendix:pretraining}).
For TMA feature extraction, we used an image size of (512, 512), resulting in the feature map size of (64, 64) after extraction at the $L=4$ layer. The feature maps $\mathcal{F}_L$ were resized to (128,128) and flattened for K-means cluster fitting with K=3. 
All experiments are done with an NVIDIA A5000 GPU.




\subsection{Evaluation Metrics}
To evaluate the robustness of our unsupervised segmentation approach, we calculate the following metrics:
\begin{inparaenum}[(i)]
    \item the intra-class correlation coefficient (ICC) measures the reliability or consistency of measurements within the same group~\cite{mcgraw1996_icc};
    \item Cohen's Kappa (Kappa) measure that quantifies measurement agreement for categorical data while accounting for the agreement occurring by chance~\cite{Cohen_kappa}; and 
    \item Dice similarity coefficient.
\end{inparaenum}
We employed these metrics to evaluate the consistency of K-means cluster label images across 12 rotated versions at 30-degree increments. 
For each input image, we compute each metric across all post-rotation images where each pixel is assigned a cluster label, to quantify how consistently each pixel retains its cluster label after rotation. We assess significant differences ($\alpha$=0.05) between models by computing Wilcoxon rank-sum tests comparing across result metrics.





\subsection{Intra-Subject Rotation Analysis}
For each testing subject, features were initially extracted from the TMA image in its original orientation (0 degrees). From the set of valid masked features, a random subsampling of $n$=2000 feature samples was performed. The original TMA image was subsequently rotated at 30-degree intervals, yielding 12 rotations (0 to 330 degrees). Features were extracted from each rotated image, and clustered using the K-means model fitted on the 0-degree orientation image, resulting in 12 segmentation images per subject. 
To facilitate metric calculations, segmentations were rotated back to their original orientation, providing 12 post-rotation images for each subject.
SRENet exhibited higher intra-subject ICC, Kappa and Dice when compared to both E2CNN and ResNet (p$<$0.05) (Tab.~\ref{tab:agreement_metrics}), indicating superior label consistency following rotation. 





\subsection{Inter-Subject Rotation Analysis}

For inter-subject analysis, features were extracted from TMA images of 25 training subjects at their original orientation (0 degrees). $n$= 500 feature samples were randomly selected for each subject, yielding a total of $n$=12,500 features across all training subjects. This aggregated feature embedding was used to train the K-means clustering model.
For 25 testing subjects, TMA images were rotated at 30-degree intervals from 0 to 330 degrees, generating 12 rotated images per subject. Features from these rotated images were clustered with the trained K-means model, resulting in 12 cluster-labeled images per subject. These cluster-labeled images were then rotated back to their original orientation for ICC calculation.
In the inter-subject analysis, SRENet again exhibited higher ICC, Kappa and Dice performance compared to both E2CNN and ResNet (p$<$0.05) (Tab.~\ref{tab:agreement_metrics}).
Both intra- and inter-subject analyses underscore the superior performance of SRENet in maintaining unsupervised cluster-label consistency against rotations when compared to E2CNN and ResNet.

\input{figures/fig_results_intraintersubject}

\input{figures/tab_results_intra_inter}

\input{figures/fig_results_path}


Comparison of the unsupervised segmentation results to ground-truth is challenging when no mapping exists between the pathology labels (Gleason Grade categories) and the cluster labels provided by K-means.
An alternative way to evaluate the quality of the unsupervised feature embeddings evaluates feature embedding quality by creating an embedding space from a subset of features, mapping pathologist labels to it, and training a classifier within this space. The approach then projects all image pixel features to this space and uses the classifier to segment images, effectively assessing how well the unsupervised embeddings align with pathological ground truth.
Details of this procedure can be found in the Appendix (Sec.~\ref{sec:appendix:pathologist_eval}).
We evaluated the performance of this mapping using Dice similarity coefficient.  
SRENet demonstrated higher mean$\pm$SD Dice values (0.91$\pm$0.07) than either E2CNN (0.82$\pm$0.12) or ResNet (0.83$\pm$0.12) (Appendix Fig.~\ref{fig:dice_boxplots}) and show example images in Appendix Fig.~\ref{fig:dice_example}.




\subsection{Qualitative Evaluation}

For unsupervised feature learning, cluster segmentations from SRENet, E2CNN, and ResNet were visualized for intra-subject and inter-subject~(Fig.~\ref{fig:intraintersubject}) analyses. Our SRENet produces consistent segmentations across rotations, while conventional CNN exhibits changes that hinder meaningful cluster visualization. Although E2CNN preserves segmentation clusters at certain rotation angles, substantial variations occur between these angles.
We further compare our clustering results to pathologist segmentation and demonstrate the promising correspondence between pathologist labeling and our unsupervised cluster segmentation method(Fig.~\ref{fig:path}), highlighting the potential of our method in histopathology applications. 

To qualitatively assess the equivariant embeddings from pre-training with the NCT-CRC dataset, we visually compare SRENet, E2CNN and ResNet feature spaces using t-distributed Stochastic Neighbor Embedding (t-SNE)~\cite{Van_der_Maaten2008-xh} in the Appendix (Fig.~\ref{sec:appendix:embeddings}). When images are rotated, the standard ResNet feature embeddings shift significantly within the t-SNE space, mixing labeled clusters, whereas the SRENet feature embeddings stay notably stable, keeping labeled clusters mostly well-separated.
E2CNN, the state-of-the-art equivariant method, also exhibits noted shifting in clusters compared to the stable performance of SRENet.




\subsection{Ablation Studies}

We evaluate the performance of K-means clustering with $K = 2, 3, 4$, and Gaussian mixture clustering on the intra- and inter-subject performance using SRENet, E2CNN, and ResNet (see Appendix Sec.~\ref{sec:appendix:ablation}). Our results (Tab.~\ref{tab:clustering_analysis}) consistently show that SRENet holds superior performance compared to E2CNN and ResNet regardless of the clustering method employed. Although both intra-subject and inter-subject analyses reveal that as the number of clusters decreases, performance metrics improve, it is likely due to reduced class complexity and lower chances of incorrect class assignment. While fewer clusters lead to better evaluation performance, they may sacrifice the ability to distinguish between different tissue types. SRENet's robust feature extraction and classification capabilities make it the best performing model in the examined clustering scenarios, and careful consideration is required to balance the number of clusters for optimal application-specific outcomes.
