\appendix
\section{Dataset}
\label{Appendix: stard, clinical variables }



\begin{figure}[htbp]
    \centering

    \includegraphics[width=0.5\textwidth]{figures/Flowchart.pdf}
    \caption{Inclusion and exclusion criteria for prostate MRI exams used in this study.}
    \label{fig: stard diagram}
\end{figure}
\subsection{Clinical variables}
\begin{table}[htbp]
\centering
\label{tab:clinical_info_stoh_vertical}
\begin{tabular*}{\textwidth}{@{\extracolsep{\fill}} l >{\raggedright\arraybackslash}p{7cm}}
\hline
\makecell[l]{\textbf{Feature}} &
\makecell[l]{\textbf{Pretraining unlabeled data (STOH)}} \\
\hline
Sites & 1 \\
Patients & 2431 \\
\makecell[l]{Median age, years} & 67 (61--72) \\
\makecell[l]{Median PSA, ng/mL} & 7 (5--11) \\
\makecell[l]{Median prostate volume, mL} & 47 (33--67) \\
\makecell[l]{Field strength, Tesla} & 3 \\
Cases & 2431 \\
\makecell[l]{Positive MRI lesions} & 1726 \\
PI-RADS 3 & 441 (25\%) \\
PI-RADS 4 & 589 (34\%) \\
PI-RADS 5 & 696 (40\%) \\
\makecell[l]{Patient-level clinically\\ significant PCa (GG $\geq$ 2)} & 799 \\
\hline
\end{tabular*}
\caption{Clinical variables for the St. Olavs Hospital (STOH) unlabeled cohort: Prostate Specific Antigen (PSA), Prostate Imaging-Reporting and Data System (PI-RADS), and Gleason Grade (GG).}
\end{table}


\FloatBarrier
\section{Pretraining}
\label{Appendix: pretraining}
\subsection{Volume Fusion}


Given two sub-volumes \( I_b \) (background) and \( I_f \) (foreground), generate a fused volume \( X \in \mathbb{R}^{D \times H \times W} \) using a voxel-wise fusion map \( \alpha \in \mathcal{V} \), as:
\[
X = \alpha I_f + (1 - \alpha) I_b.
\]
The corresponding label map \( Y \) is derived from \( \alpha \), and the model is trained to predict voxel-wise fusion classes using the following segmentation loss:
\[
\mathcal{L}_{\text{sup}} = \frac{1}{2}(\mathcal{L}_{\text{dice}} + \mathcal{L}_{\text{ce}}).
\]
For additional details, we refer the reader to the original reference.
\subsection{Model Genesis}
The four augmentations include: (1) \textit{Non-linear intensity transformations}, which monotonically distort voxel intensities to encourage the model to capture tissue appearance and contrast; (2) \textit{Local pixel shuffling}, which permutes voxel positions within a local window, helping the model learn about textures and boundaries; (3) \textit{Inner cutout} and (4) \textit{Outer cutout}, which both involve masking parts of a sub-volume using arbitrarily shaped windows. In inner cutout, the central region is masked while the outer area is retained; in outer cutout, the opposite is done. These augmentations guide the model to interpolate or extrapolate missing information, promoting awareness of local and global anatomical continuity and geometry.


Consider a set of sub-volumes \( \mathcal{X} = \{x_1, x_2, \dots, x_n\} \) is extracted from raw 3D scans. These sub-volumes are transformed using a distortion function \( f(\cdot) \), producing a set \( \widetilde{\mathcal{X}} = f(\mathcal{X}) = \{\tilde{x}_1, \tilde{x}_2, \dots, \tilde{x}_n\} \). The model is trained to reconstruct the original sub-volumes from the distorted ones. This reconstruction is formulated as learning a function \( g(\cdot) \) such that:
\[
g\left(\widetilde{\mathcal{X}}\right) = \mathcal{X} = f^{-1}\left(\widetilde{\mathcal{X}}\right).
\]
The network minimizes the mean squared error (MSE) between the predicted output \( \hat{x}_i = g(\tilde{x}_i) \) and the original input \( x_i \):
\[
\mathcal{L}_{\text{MG}} = \frac{1}{n} \sum_{i=1}^{n} \| \hat{x}_i - x_i \|_2^2.
\]
\FloatBarrier
\subsection{Masked Autoencoders}

For a 3D U-Mamba model~\cite{ma2024u}, let the raw input/target volume be \( X \in \mathbb{R}^{B \times D \times H \times W \times C} \), the predicted output volume be \( \hat{X} \in \mathbb{R}^{B \times D \times H \times W \times C} \), and a binary mask \( M \) of the same shape, where \( M = 1 \) indicates visible (unmasked) voxels and \( M = 0 \) indicates masked voxels. The reconstruction loss based on the \( L_2 \) norm is defined as:

\[
\mathcal{L}_{\text{MAE}}= \frac{\sum (1 - M) \cdot (\hat{X} - X)^2}{\sum (1 - M)}
\]

Here, the numerator computes the squared reconstruction error only over masked voxels, and the denominator normalizes by the number of masked elements. This encourages the model to infer the missing regions solely from the visible context.
\FloatBarrier
\section{Prostate and zonal segmentation results}
\label{Appendix: Zonal Segmentation}
One additional advantage of our model is that the auxiliary task also provides prostate segmentation. This can be clinically useful, as accurate segmentation of the prostate and its zones is important for biopsy guidance. In clinical practice, transrectal ultrasound (TRUS) guided biopsies are often performed using MRI–ultrasound fusion, in which T$_2$ weighted MRI is fused with ultrasound to enable MRI-targeted biopsies and improve lesion localization. Reported inter-reader variability of $\mathrm{DSC}_{\mathrm{PZ}} = 0.75$ and $\mathrm{DSC}_{\mathrm{TZ}} = 0.87$ was closely matched by our model, which achieved DSC scores of 0.76 for the peripheral zone (PZ) and 0.87 for the transition zone (TZ), respectively.
 

\begin{figure}[htbp]
    \centering
    \includegraphics[width=0.7\textwidth]{figures/dsc_zones_dlr.pdf}
    \caption{Prostate zonal segmentation (Peripheral Zone (PZ) and Transition Zone (TZ)) results on in-house St. Olavs Hospital cases (N=200).}
    \label{fig:dsc_score}
\end{figure}



\FloatBarrier
\section{Details on Spark3D}
\label{Appendix: Spark3D}
\small
\renewcommand{\arraystretch}{0.85}
\begin{table}[t]
\centering
\small
\begin{tabular}{l l}
\hline
\textbf{Hyperparameter} & \textbf{Value} \\
\hline
features per stage & [32, 64, 128, 256, 320, 320] \\
norm op & torch.nn.InstanceNorm3d \\
nonlin & torch.nn.LeakyReLU \\
nblocks per stage & [1, 3, 4, 6, 6, 6] \\
conv op & torch.nn.Conv3d \\
nconv per stage decoder & [2, 2, 2, 2, 2] \\
kernel sizes & [[3,3,3], [3,3,3], [3,3,3], [3,3,3], [3,3,3], [3,3,3]] \\
nstages & 6 \\
strides & [[1,1,1], [1,2,2], [1,2,2], [2,2,2], [2,2,2], [1,2,2]] \\
network class name & ResidualEncoderUNet \\
\hline
\end{tabular}
\caption{Residual Encoder UNet (nnU-Netv2) configuration adapted for anisotropic bpMRI.}
\end{table}
\FloatBarrier
\section{Details on Swin-UNETR}
\label{Appendix: Swin-UNETR}
\begin{table}[h!]
\centering
\begin{tabular}{l l}
\hline
\textbf{Hyperparameter} & \textbf{Value} \\
\hline
network class name & SwinUNETR (MONAI) \\
in\_channels & 3 \\
out\_channels & 2 \\
img\_size & \small $[256, 256, 32]$ \\
feature\_size & 48 \\
use\_v2 & True \\
depths (default) & \small $(2, 2, 2, 2)$ \\
num\_heads (default) & \small $(3, 6, 12, 24)$ \\
patch\_size (default) & \small $(2, 2, 2)$ \\
window\_size (default) & \small $(7, 7, 7)$ \\
norm\_name (default) & instance \\
spatial\_dims (default) & 3 \\
\hline
\end{tabular}
\caption{Swin-UNETR configuration hyper-parameters used in this study.}
\label{Appendix:swinunetr_hparams}
\end{table}






