\clearpage
\setcounter{page}{1}
\setcounter{figure}{0}
\setcounter{table}{1}
\setcounter{section}{0}

\renewcommand{\thesection}{A\arabic{section}}
\renewcommand{\thefigure}{A\arabic{figure}}
\renewcommand{\thetable}{A\arabic{table}}




\section*{Appendix}
\label{sec:appendix}
% \appendix





\section{Proof of Circular Kernel Equivariance}
\label{sec:appendix:proof}

In short, the SRE-Conv kernel achieves rotational equivariance with a centrally symmetric kernel, where each circular ring from the center represents one trainable parameter. This design shares values among parameters symmetric to the center, providing local rotational and reflection invariance via the Hadamard product and global equivariance under convolution. Consider a 2D continuous function \( f(x, y) \) and central symmetric convolutional kernel \( h(x, y) \). Rotation \( R(\cdot) \) of their convolution is:
\[
R(h*f)(x,y) = R\left(\iint h(u,v)f(x-u, y-v) \, dudv \right) = \iint h(u',v')f(x'-u', y'-v') \, du'dv',
\]
where \( (x', y')=(x\cos\theta - y\sin\theta, x\sin\theta + y\cos\theta) \) and similarly for \( (u', v') \). \\
Using the linearity of rotation, we establish:
\[
\frac{\partial u'}{\partial u}\frac{\partial v'}{\partial v} - \frac{\partial v'}{\partial u}\frac{\partial u'}{\partial v} = \cos^2\theta + \sin^2\theta = 1.
\]
Thus, we have $du'dv' = dudv$, and
\[
R(h*f)(x, y) = \iint R(h(u,v)) R(f(x-u, y-v)) \, dudv = R(h) * R(f) = h * R(f),
\]
proving rotation equivariance due to \( h \)'s symmetry.
The kernel’s translation equivariance ensures the output is equivariant to both global and sub-region rotations and reflections.







\section{Detailed Quantitative Evaluation and Statistical Testing Results}
\label{sec:appendix:quantitative}


Figure~\ref{fig:icc} shows full ICC range for SRENet, E2CNN and ResNet in boxplot. For intra-subject analysis, the median ICC was 0.91 (IQR: 0.89, 0.96) for SRENet. In contrast, E2CNN had a median
ICC of 0.86 (0.86, 0.90), while ResNet demonstrated a median ICC of 0.84 (0.82, 0.87). In
the inter-subject analysis, the median ICC for SRENet was 0.91, with IQR of 0.90 to 0.92.
In comparison, E2CNN showed a median ICC of 0.88 (0.86, 0.91), while ResNet had a
median ICC of 0.83 (0.82, 0.84).

\input{figures/fig_results_icc}



\section{Ablation Study Quantitative Results}
\label{sec:appendix:ablation}

Table~\ref{tab:clustering_analysis} shows K-means clustering with $K = 2, 3, 4$ and Gaussian mixture clustering on intra- and inter-subject performance using SRENet, E2CNN, and ResNet. SRENet outperforms E2CNN and ResNet across all clustering methods. Performance metrics improve with fewer clusters likely due to reduced class complexity and lower misclassification rates, although too few clusters might limit the ability to distinguish different tissue types. SRENet shows robust feature extraction and classification capabilities make and optimal outcomes require balancing the number of clusters.

\input{figures/tab_results_clustering}







\section{Comparison to Pathologist Segmentation}
\label{sec:appendix:pathologist_eval}


Comparison of the unsupervised segmentation results to ground-truth is challenging when no mapping exists between the pathology labels (Gleason Grade categories) and the cluster labels provided by K-means. An alternative way to evaluate the quality of the unsupervised feature embeddings involves the following process: 
\begin{inparaenum}[(1)]
    \item define an embedding space (using a principal component analysis utilizing 99\% of the cumulative variance) using a small subset of the unsupervised features (100 sample from each subject) from all subjects in the inter-subject testing cohort; 
    \item map the ground-truth pathologist labels from this subset onto each point in the embedding space; 
    \item train a supervised learning classifier (k-nearest neighbor with k=3) within the embedding space; 
    \item project the feature vectors from all image pixel locations (masked by the ground-truth mask for clarity) to the embedding space; and 
    \item classify the projected features from each image using the trained classifier to segment the image. The approach effectively evaluates how well the unsupervised feature embeddings map to the ground-truth pathology. 
\end{inparaenum}
We evaluated the performance of this mapping using Dice similarity coefficient.  
We plot the distribution of Dice values for each method in Fig.~\ref{fig:dice_boxplots} and show project pathologist labels onto the example images in Fig.~\ref{fig:dice_example}.

\input{figures/fig_dice_boxplots}
\input{figures/fig_results_dice_example}





\section{Robust Equivariant Feature Embeddings}
\label{sec:appendix:embeddings}

We utilized models trained on the NCT-CRC~\cite{Kather2019-xy} dataset to showcase the ability of SRENet to learn stable imaging feature embeddings. 
We remove the final classification layers of SRENet, E2CNN~\cite{Weiler2019-yf}, and ResNet~\cite{He2016-de}, we perform a spatial average pooling on the final feature maps to produce a single vector representations for each image. 
We then extract feature embeddings from both models for rotated testing set images and applied t-distributed Stochastic Neighbor Embedding (t-SNE)~\cite{Van_der_Maaten2008-xh} for dimensionality reduction to visually assess these embeddings. 
The embeddings from SRENet remained stable and well-separated across rotations, unlike those from ResNet, which moved considerably and mixed clusters. 
SRENet's clusters were also more stable compared to the alternative SoTA equivariant model, E2CNN.
These results highlight the robustness of SRENet in maintaining stable feature embeddings crucial for consistent imaging representations.





\input{figures/fig_feature_embeddings}




\section{Model Pre-training}
\label{sec:appendix:pretraining}

We pre-train each model on the NCT-CRC~\cite{Kather2019-xy} dataset. 
NCT-CRC is a colorectal cancer dataset that contains 100,000 training images for 9 different classes and 7,180 test images.
We train each model for 50 epochs using the SGD optimizer and cosine annealing scheduler and learning rate of $2\times10^{-2}$ with cross-entropy loss. 
All experiments were done with one NVIDIA A5000 GPU using an image size of $(224, 224)$ with batch size of $24$.
As is standard practice for equivariant feature learning~\cite{Worrall2017-mv, Weiler2019-yf, Cohen2016-st}, no geometric data augmentation is applied during training to demonstrate the full capabilities of equivariant learning without introducing confounding effects. 



We evaluate model performance by computing classification accuracy on:
\begin{inparaenum}[(1)]
    \item the original test set without rotation;  
    \item the rotated test set (rotated by 10{\textdegree} increments; and
    \item the reflected test set (horizontal and vertical flips).
\end{inparaenum}
We report classification results in Table~\ref{tab:nct_comparison}.
SRENet outperforms E2CNN and ResNet in all test sets.
Additionally, we compared CNN model pre-training classification performance to a vision transformer (ViT)~\cite{Dosovitskiy2020-lr} approach.
ViT underperforms all other approaches most likely due to its requirements for large amounts of training data.
ViT also demonstrates the greatest sensitivity to Rotated data, indicating its limitations compared to equivariant approaches like SRENet and E2CNN.
Based on ViT's relatively poor performance on this pre-training task, we excluded ViT as a baseline comparison method for our unsupervised learning segmentation task (Sec.~\ref{sec:results:baselines}).



\input{figures/tab_results_nct_class}







