

\section{Methodology}
The overall framework, illustrated in \autoref{fig:graphical_abstract}, consists of two stages: self-supervised pretraining on a large-scale unlabeled prostate bpMRI dataset, followed by supervised fine-tuning on publicly available labeled data. Both stages utilize our state-of-the-art (SOTA) architecture for prostate cancer detection, as proposed in~\cite{larsen2025prostate}. Specifically, we employ three different pretext tasks: Volume Fusion \cite{wang2023mis}, Model Genesis \cite{zhou2021models}, and Masked Autoencoders (MAE) \cite{he2022masked}, using the UMamba architecture \cite{ma2024u}. The learned representations are then fine-tuned for the prostate cancer detection task using our multi-task, prostate zone-aware model, and compared against Swin-UNETR, the vanilla UMamba, and Spark3D, a recent SOTA method in medical self-supervised learning. \\
\begin{figure}[h]
\centering
% \hspace*{0.7cm}
\includegraphics[width=1\textwidth]{figures/overall_process.pdf}
\caption{The UMamba model is pretrained separately using three SSL objectives; Model Genesis (MG), Masked Autoencoders (MAE), and Volume Fusion (VF). The output channel configurations are 3$^\dagger$ for MG and MAE, and 5$^{*}$ for VF. The pretrained UMamba model is subsequently fine-tuned using a multi-task objective that jointly predicts prostate cancer and prostate zones (peripheral and transitional zones).}
 \label{fig:graphical_abstract}
\end{figure}

\subsection{Dataset}
In this study, we utilized a total of 5,189 prostate MRI scans for self-supervised pretraining, fine-tuning, and evaluation. The dataset includes scans (N=2,431) derived from an institutional cohort at St. Olavs hospital (STOH), the PI-CAI public training and development set (1,500 cases), the PI-CAI hidden tuning cohort (100 cases), the PI-CAI hidden testing cohort (1,000 cases), and the external OOD Prostate158 (N=158) dataset. Across all datasets, positive cases are defined as histologically-confirmed clinically significant (ISUP~$\geq$~2), and negatives are determined based on histology (ISUP~$\leq$~1) or MRI (PI-RADS~$\leq$~2) findings, with a follow-up period of at least three years, except for in-house unlabeled data and P158. An overview of the datasets used across all stages is illustrated in \autoref{tab:Dataset} and the clinical variables associated with the in-house unlabeled cohort are provided in \hyperref[Appendix: stard, clinical variables  ]{Appendix~\ref*{Appendix: stard, clinical variables  }}.

For a detailed description of the inclusion and exclusion criteria, annotation methods, and clinical variables within the PI-CAI dataset, please refer to \cite{saha2024artificial}. Further descriptions of the in-house dataset and a short summary of PI-CAI are provided below.

The \textbf{In-House} dataset comprises prostate MRI exams from St. Olavs Hospital, Trondheim University Hospital (STOH)/ NTNU, Norway, collected between June 2014 and July 2024. Inclusion was based on suspicion of PCa via elevated PSA, digital rectal exam, repeated biopsy, or active surveillance. For patients with multiple scans, only the first available diagnostic scan at initial suspicion or diagnosis was retained. Scans obtained for in-bore biopsy, staging, or post-treatment follow-up were excluded, as well as scans with missing sequences or extraction errors. Of these, 2,431 unlabeled scans were used for self-supervised pretraining. The Regional Committee for Medical and Health Ethics, Mid-Norway, approved the use of the in-house dataset (identifier 2017/576). An additional dataset of 200 cases with expert annotations \cite{kruger2021multiparametric} of csPCa and prostate zones was excluded because they were partly (197/200) included in the PI-CAI test set. These cases were annotated using ITK-SNAP for csPCa and zonal anatomy by a radiology resident with two years of experience, supervised by a senior radiologist with over 10 years of experience. Moreover, the expert zonal segmentations were also used to evaluate the predictions of the prostate zones. \autoref{fig: stard diagram} in \hyperref[Appendix: stard, clinical variables ]{Appendix~\ref*{Appendix: stard, clinical variables }} further illustrates the selection process. 

\begin{table}[ht]
\centering
\renewcommand{\arraystretch}{1.2}
\begin{tabular*}{\textwidth}{@{\extracolsep{\fill}} >{\raggedright\arraybackslash}p{3.7cm} c >{\raggedright\arraybackslash}p{2.5cm} >{\raggedright\arraybackslash}p{3.5cm} >{\raggedright\arraybackslash}p{3.5cm}}
\toprule
\textbf{Dataset} & \textbf{Cases} & \textbf{Type} & \textbf{Annotations} & \textbf{Stages} \\
\midrule
In-House & 2,431 & 3T bpMRI & None & SSL Pretraining \\
PI-CAI Training and Development set & 1,500 & 1.5T, 3T bpMRI & PCa$^1$, Zonal$^2$ & Fine-tuning, Ablation \\
PI-CAI Hidden tuning cohort & 100 & 1.5T, 3T bpMRI & PCa & Testing \\
PI-CAI Hidden testing cohort & 1,000 & 1.5T, 3T bpMRI & PCa & Testing \\
Prostate158 & 158 & 3T & PCa & Testing \\
\bottomrule
\end{tabular*}
\caption{Dataset information and usage across stages of the framework. 
$^1$Includes 200 AI-generated PCa labels; 
$^2$All zonal labels are AI-generated.}
\label{tab:Dataset}
\end{table}
The \textbf{Public Training and Development Set} is a subset of the PI-CAI challenge and consists of 1,500 cases, of which 425 are csPCa. Of these cases, 220 were annotated by human experts, while the remaining annotations were generated using validated AI-based pipelines provided by the PI-CAI challenge organizers, as described in \cite{bosma2021annotation}. Additionally, prostate zonal annotations were provided, generated by training nnUNet \cite{isensee2021nnu}  on the ProstateX dataset \cite{yuan2025z}, a subset of the PI-CAI training set.

The \textbf{PI-CAI Hidden Tuning Set} consists of 100 cases, 46 of which are clinically significant PCa. We used this cohort to evaluate all trained models for performance comparison and to facilitate the selection of the best model.

The \textbf{PI-CAI Hidden Test Set} is a large-scale testing cohort comprising 1,000 cases, of which 398 are csPCa. These cases originate from eight sites across four centers in the Netherlands and one center in Norway. The best performing model, selected from the PI-CAI hidden tuning set, was evaluated on this cohort as well as its non-SSL counterpart to assess the effect of pretraining. 

The \textbf{Prostate158 (P158) dataset} ~\cite{adams2022prostate158} is a publicly available, expert-annotated external cohort comprising 158 biparametric 3T prostate MRI scans acquired using Siemens scanners. Each case includes $T_2$-weighted and ADC sequences with voxel-wise annotations of zonal anatomy and csPCa lesion defined as PI-RADS $\geq 4$ and ISUP$\geq 1$, performed in ITK-SNAP by two radiologists (6 and 8 years of experience). Notably, P158 differs from the PI-CAI hidden test set in terms of how the clinical significance of PCa is defined and is therefore used only to assess model generalizability on an external OOD cohort, rather than as a benchmark dataset.

\subsection{Architecture}
UMamba \cite{ma2024u}, an efficient adaptation of the Mamba framework~\cite{gu2024mamba}, models long-range dependencies with linear time complexity as an alternative to convolutional and transformer-based networks. Our prior work~\cite{larsen2025prostate} introduced UMamba-MTL, extending the UMambabot variant with a multi-task single-decoder framework that integrates anatomical zone segmentation of the peripheral zone (PZ) and transition zone (TZ) as an auxiliary task alongside clinically significant prostate cancer (csPCa) detection. UMamba-MTL achieved state-of-the-art csPCa detection on an out-of-distribution in-house dataset (N=200), surpassing CNN and hybrid CNN-transformer models, and showed promising results on the PI-CAI Hidden tuning cohort (N=100). Motivated by these outcomes, we use UMamba for pretraining and UMamba-MTL for finetuning in this study; details are available in \cite{larsen2025prostate}.

\subsection{Pretraining}
\label{sec:pretraining}
Consider $D_u = \{X_u\}$ to be a large set of unlabeled 3D MRI volumes, where each volume \( X \in \mathbb{R}^{C \times D \times H \times W} \), with \( C \) denoting the number of imaging channels (T$_2$w, ADC, HBV), and \( D \times H \times W \) representing the volumetric spatial dimensions.
We randomly initialize a 3D UMamba model \cite{ma2024u}, comprising an encoder \( f_\theta \) and a task-specific head \( f_\varphi  \), and train it on \( D_u \) by optimizing one of several self-supervised objectives \( \mathcal{L}_{\text{SSL}} \), depending on the chosen pretext task.
The aim is to encode meaningful anatomical and semantic features from the multi-modal MRI volumes by solving a proxy task.
Formally, self-supervised pretraining aims to learn an encoder function \( f_{\theta}: X \mapsto z \), where \( X \sim D_u \) is an unlabeled input volume, and \( z \) is its latent representation. A task-specific decoder \( f_{\varphi} \) maps this representation to an output, which is trained to approximate a predefined target \( T(X) \) derived from \( X \). 
\[
\min_{\theta, \varphi} \; \mathbb{E}_{X \sim D_u} \left[ \mathcal{L}_{\text{SSL}}\left(f_\varphi(f_\theta(X)), T(X)\right) \right]
\]
Here, $T(X)$ is the task-specific target: for reconstruction-based objectives 
, $T(X) = X$; and for 
pseudo segmentation-based tasks (Volume Fusion), $T(X)$ is a voxel-wise 
label map derived from the fusion process. 

\textbf{Volume Fusion}, as presented in \cite{wang2023mis}, is based on a pseudo-segmentation pretext task, where two sub-volumes are fused using different fusion categories. The model takes the fused volume as input and predicts the fusion category of each voxel. This pretraining strategy encourages the model to learn fine-grained spatial and semantic details. 

\textbf{Model Genesis} is a unified self-supervised framework proposed in~\cite{zhou2021models}, which corrupts 3D medical images using four types of augmentations. The model is then trained to reconstruct anatomical patterns from these distorted inputs. By learning such representations, the model becomes more generalizable across different organs, diseases, and imaging modalities.

\textbf{Masked Autoendoers} are a self-supervised method for learning representations by masking parts of the input image and reconstructing them in an autoencoder fashion \cite{he2022masked}. The reconstruction of occluded regions encourages the model to learn meaningful image representations and anatomical context. 
We train the 3D UMamba model~\cite{ma2024u} using an $L_2$ loss computed over the masked regions with a masking ratio of 75\%, inspired by the work of~\cite{he2022masked}.
Further methodological details and formulations for each pretraining strategy are provided in \hyperref[Appendix: pretraining]{Appendix~\ref*{Appendix: pretraining}}.

\subsection{Fine-Tuning} 
\label{section:fine_tuning}
The pretrained models are fine-tuned using the multi-task UMamba framework, following the setup in~\cite{larsen2025prostate}. For reconstruction-based pretraining (MG, MAE), we replace the pretrained decoder's final convolutional layers with a segmentation head. Each $\text{Conv3D}$ layer in this head is reinitialized to output $C_{\text{out}} = 5$ channels for PCa and zone prediction, whereas Volume Fusion–pretrained models already produce compatible multi-channel segmentation outputs and therefore require no adaptation. For fine-tuning, we apply the composite multi-task loss formulation introduced in~\cite{larsen2025prostate}:
\[
\mathcal{L}_{\text{csPCa}} = \mathcal{L}_{\text{Focal}}, \quad
\mathcal{L}_{\text{Zonal}} = \lambda \mathcal{L}_{\text{Dice}} + (1 - \lambda)\mathcal{L}_{\text{CE}}, \quad
\mathcal{L}_{\text{Total}} = \mathcal{L}_{\text{csPCa}} + \beta \mathcal{L}_{\text{Zonal}},
\]
where \( \lambda = 0.5 \) balances Dice and Cross-Entropy losses, and \( \beta = 0.2 \) adjusts the relative weight of the zonal segmentation task. This formulation ensures effective fine-tuning across both the highly imbalanced csPCa detection task and the anatomical zonal delineation objective.


\subsection{Comparison with other methods}
\label{sec: comp methods}
We compare our approach against three baselines: transformer based Swin-UNETR model, a plain UMamba model, and Spark3D, a SOTA CNN-based SSL method derived from the nnU-Netv2 architecture~\cite{wald2025revisiting} and modified for MAE pretraining. For plain UMamba, three pretrained models (VF, MG, MAE) were fine-tuned using Focal Loss~\cite{lin2017focal}.



The Spark3D and Swin-UNETR baselines were adapted to three-channel bpMRI input and pretrained using an MAE-based reconstruction objective. Spark3D utilizes a residual encoder U-Netv2 backbone~\cite{isensee2024nnu}, while Swin-UNETR employs a hybrid CNN–Transformer architecture featuring hierarchical Swin Transformer encoding and convolutional decoding~\cite{hatamizadeh2021swin}. Both models were fine-tuned using Focal Loss~\cite{lin2017focal}, with weights for both the encoder and decoder transferred from the pretrained state. To ensure a fair comparison, model checkpoints for both baselines were selected using the same evaluation metric and methodology as our proposed method.

\subsection{Evaluation Metric and Implementation details}
\textbf{Evaluation Metric}
Following the PI-CAI challenge guidelines~\cite{saha2024artificial}, we evaluated model performance using the PI-CAI score, which is defined as the mean of the Average Precision (AP) reflecting lesion-level detection performance and the Area Under the Receiver Operating Characteristic Curve (AUROC) indicating patient-level diagnosis. The metrics are computed based on extracted non-overlapping lesion detection maps \cite{saha2024artificial, bosma2023semisupervised}, using the softmax output of the models. 
The PI-CAI score is given by
\[
\text{score} = \frac{\text{AP} + \text{AUC}}{2}
\]
In addition, we report a clinically relevant operating point using Free-Response ROC (FROC) analysis for the PI-CAI hidden test set (N=1000), specifically sensitivity (Sens3), which measures lesion-level sensitivity at a radiologist-equivalent false-positive rate (PI-RADS $\geq$ 3). This provides a clinically interpretable measure linked to planning of targeted biopsies.

\textbf{Implementation details}
Diffusion weighted images (DWI: ADC, HBV) were resampled to \(T_2\)w resolution before pretraining and fine-tuning. A patch size of \(256 \times 256 \times 20\) was extracted, centered on the prostate using cropping, with zero padding or reflect padding applied as needed. For pretraining and fine-tuning we utilized PyTorch \cite{paszke2019pytorch}, MONAI \cite{cardoso2022monai}, nnU-Netv2 \cite{isensee2024nnu} and nnSSL \cite{wald2025revisiting} frameworks.

For pretraining, the in-house dataset was split into 95\% training (N=2,331) and 5\% validation (N=100). The model received three channel inputs with augmentations tailored to each pretext task, following protocols described in \cite{wald2025revisiting}. Pretraining was performed on an NVIDIA A40 GPU for 700 epochs with z-score normalized input using SGD with a polynomial learning rate scheduler. Spark3D and Swin-UNETR were trained with an MAE objective, incorporating architectural and patch-size modifications, respectively, to address bpMRI anisotropy (see Appendices~\ref{Appendix: Spark3D} and~\ref{Appendix: Swin-UNETR} for details).

For fine-tuning, models were trained on the PI-CAI public training set (N=1,500), split into 80\% training and 20\% validation using five-fold cross-validation. Five-fold cross-validation was performed using a mean ensemble of softmax outputs, and prostate detection maps were subsequently generated using the lesion candidate extraction method described in \cite{saha2024artificial,bosma2023semisupervised}. The PCa detection maps were obtained using the same procedure across all comparative methods. All fine-tuned five-fold cross-validated models were wrapped in Docker containers for submission to the PI-CAI challenge forum. 

Both UMamba and UMamba-MTL backbones were trained for 130 epochs, as models typically converged by 100 epochs, following \cite{larsen2025prostate}. Weights from both the encoder and decoder were transferred, with a warm-up phase of up to 10 epochs, during which the encoder was gradually unfrozen. Fine-tuning used the AdamW optimizer, a learning rate \(1 \times 10^{-4}\), a batch size of 8, a cosine annealing scheduler, and random augmentations \cite{larsen2025prostate}.
 An ablation study of fine-tuning strategies is detailed in \hyperref[results: fine-tuning strategies]{section~\ref*{results: fine-tuning strategies}}

Similarly, MAE pretrained Swin-UNETR was fine-tuned following the same protocol as the UMamba models, while its randomly initialized counterpart was trained from scratch using the same training setup; for both models, the patch size was set to $256 \times 256 \times 32$ to satisfy architectural requirements.


Spark3D fine-tuning on the PI-CAI training set followed the same training hyperparameters reported in~\cite{wald2025revisiting} using a five-fold split and training for 1000 epochs, with a warmup phase of 12.5k iterations ($\sim$50 nnU-Net epochs). Training employed SGD with momentum 0.99, a learning rate \(1 \times 10^{-3}\), a batch size of 3, a polynomial scheduler, and a weight decay of \(3 \times 10^{-5}\). The Spark3D (ResEnc-UNet/nnU-Netv2) without SSL was trained similarly but with a higher learning rate \(1 \times 10^{-2}\). All fine-tuning was conducted on a single NVIDIA A100 GPU using the IDUN cluster \cite{sjalander2019epic}.

