
\section{Results}
\subsection{Quantitative and Qualitative results}


We trained multiple self-supervised learning (SSL) objectives across all backbones (nnU-Netv2,Swin-UNETR, UMamba, and UMamba-MTL), followed by fine-tuning, as detailed in \hyperref[sec:pretraining]{section~\ref*{sec:pretraining}} and \hyperref[section:fine_tuning]{section~\ref*{section:fine_tuning}}.
Based on the five-fold cross-validated results on the PI-CAI hidden tuning cohort (N=100), UMamba-MTL with MAE pretraining (\textbf{UMamba-ProSSL}) achieved the highest PI-CAI score, as shown in \autoref{tab:picai_test_100_95ci}. This model was subsequently evaluated on the PI-CAI hidden testing cohort (N=1,000). For comparison, a non-SSL UMamba-MTL model was also tested, and the results are reported in \autoref{tab:picai_test_1000_ci95}. Additionally, all SSL and non-SSL models across all backbone architectures were evaluated on the external OOD Prostate158 (P158) dataset, and the corresponding results are presented in \autoref{tab:p158_results}.

% \FloatBarrier
\begin{table}[t]
\centering
\caption{Performance of nnU-Netv2, SWin-UNETR UMamba, and UMamba-MTL with different SSL pretraining strategies on the PI-CAI hidden tuning cohort. PI-CAI Score, AUC, and AP are reported with 95\% confidence intervals, along with the corresponding leaderboard ranks. $\dagger$ denotes the best-performing model, referred to as \textbf{UMamba-ProSSL}.}
\label{tab:picai_test_100_95ci}
\begin{tabular}{llccccc}
\hline
\textbf{Model} & 
\makecell{\textbf{SSL}\\\textbf{Technique}} & 
\makecell{\textbf{PI-CAI}\\\textbf{Score}} & 
\textbf{AUC} & 
\textbf{AP} & 
\textbf{Rank} \\
\hline

\multirow{2}{*}{nnU-Netv2}
  & Scratch &
    \makecell{0.710 \\ (0.603–0.818)} &
    \makecell{0.820 \\ (0.725–0.903)} &
    \makecell{0.601 \\ (0.452–0.756)} &
    221$^{\text{st}}$ \\  
 & \textbf{Spark3D} &
     \makecell{\textbf{0.736} \\ (0.631–0.834)} &
     \makecell{\textbf{0.841} \\ (0.754–0.913)} &
     \makecell{\textbf{0.631} \\ (0.484–0.771)} &
     \textbf{155$^{\text{st}}$} \\
\hline
\multirow{2}{*}{Swin-UNETR}
  & Scratch &
    \makecell{0.665 \\ (0.556–0.772)} &
    \makecell{0.792 \\ (0.692–0.883)} &
    \makecell{0.537 \\ (0.393–0.682)} &
    269$^{\text{st}}$ \\  
 & \textbf{MAE} &
     \makecell{\textbf{0.699} \\ (0.593–0.798)} &
     \makecell{\textbf{0.805} \\ (0.714–0.855)} &
     \makecell{\textbf{0.594} \\ (0.453–0.729)} &
     \textbf{235$^{\text{st}}$} \\

\hline
\multirow{4}{*}{UMamba}
  & Scratch &
    \makecell{0.735 \\ (0.631–0.826)} &
    \makecell{0.843 \\ (0.760–0.914)} &
    \makecell{0.627 \\ (0.483–0.751)} &
    156$^{\text{th}}$ \\
  & Volume Fusion &
    \makecell{0.716 \\ (0.611–0.818)} &
    \makecell{0.827 \\ (0.739–0.902)} &
    \makecell{0.605 \\ (0.457–0.756)} &
    205$^{\text{th}}$ \\
  & Model Genesis &
    \makecell{0.738 \\ (0.634–0.835)} &
    \makecell{0.835 \\ (0.749–0.911)} &
    \makecell{0.641 \\ (0.499–0.733)} &
    148$^{\text{th}}$ \\
  & \textbf{MAE} &
 \makecell{\textbf{0.773} \\ (0.671–0.866)} &
 \makecell{\textbf{0.862} \\ (0.777–0.933)} &
 \makecell{\textbf{0.685} \\ (0.546–0.813)} &
 \textbf{42$^{\text{nd}}$} \\


\hline
\multirow{4}{*}{\makecell{\textbf{UMamba} \\ \textbf{-MTL (ours)}}}
  & Scratch &
    \makecell{0.781 \\ (0.689–0.865)} &
    \makecell{0.867 \\ (0.791–0.931)} &
    \makecell{0.696 \\ (0.564–0.813)} &
    34$^{\text{th}}$ \\
  & Volume Fusion &
    \makecell{0.750 \\ (0.652–0.839)} &
    \makecell{0.868 \\ (0.794–0.931)} &
    \makecell{0.631 \\ (0.493–0.760)} &
    93$^{\text{rd}}$ \\
  & Model Genesis &
    \makecell{0.794 \\ (0.703–0.875)} &
    \makecell{0.888 \\ (0.818–0.944)} &
    \makecell{0.701 \\ (0.573–0.816)} &
    22$^{\text{nd}}$ \\
  & \textbf{MAE$\dagger$} &
    \makecell{\textbf{0.818} \\ (0.730–0.898)} &
    \makecell{\textbf{0.914} \\ (0.852–0.963)} &
    \makecell{\textbf{0.722} \\ (0.592–0.846)} &
    \textbf{1$^{\text{st}}$} \\
\hline

\end{tabular}
\end{table}

% \FloatBarrier
\begin{table}[t]
    \centering
    \caption{Results for UMamba-ProSSL and UMamba-MTL on the PI-CAI open development testing set (N=1,000). PI-CAI score, AUC, and AP are reported with 95\% confidence intervals along with corresponding leaderboard ranks. Sensitivity (Sens3) corresponds to lesion-level sensitivity at the radiologist-equivalent operating point (PI-RADS $\geq$ 3) derived from the FROC analysis.}

    \label{tab:picai_test_1000_ci95}
    \resizebox{\columnwidth}{!}{%
    \begin{tabular}{lcccccc}
        \hline
        \textbf{Model} &
        \makecell{\textbf{SSL}\\\textbf{Technique}} &
        \makecell{\textbf{PI-CAI}\\\textbf{Score}} &
        \textbf{AUC} &
        \textbf{AP} &
        \makecell{\textbf{Sens3}} &
        \textbf{Rank} \\
        \hline

        UMamba-ProSSL & \textbf{MAE} &
        \makecell{\textbf{0.780} \\ (0.747--0.813)} &
        \makecell{\textbf{0.905} \\ (0.885--0.924)} &
        \makecell{\textbf{0.655} \\ (0.603--0.706)} &
        \textbf{0.761} &
        \textbf{1$^{\text{st}}$} \\

        UMamba-MTL & Scratch &
        \makecell{0.776 \\ (0.704--0.807)} &
        \makecell{0.896 \\ (0.875--0.916)} &
        \makecell{0.656 \\ (0.606--0.704)} &
        0.736 &
        2$^{\text{nd}}$ \\

        \hline
    \end{tabular}
    }
\end{table}


% \FloatBarrier

\begin{table}[t]
\centering
\caption{PI-CAI socre performance of nnU-Netv2, Swin-UNETR, UMamba, and UMamba-MTL with different SSL pretraining strategies on the external out-of-distribution P158 dataset.}
\label{tab:p158_results}
\begin{tabular}{llccc}
\hline
\textbf{Model} &
\makecell{\textbf{SSL}\\\textbf{Technique}} &
\makecell{\textbf{PI-CAI}\\\textbf{Score}} &
\textbf{AUC} &
\textbf{AP} \\
\hline
\multirow{2}{*}{nnU-Netv2}
 & Scratch &
 0.653 &
 0.831 &
 0.476 \\
 & \textbf{Spark3D} &
 \textbf{0.701} &
 \textbf{0.850} &
 \textbf{0.552} \\
\hline
\multirow{2}{*}{Swin-UNETR}
 & Scratch &
 0.633 &
 0.803 &
 0.463 \\
 & \textbf{MAE} &
 \textbf{0.639} &
 \textbf{0.782} &
 \textbf{0.496} \\
\hline
\multirow{4}{*}{UMamba}
 & Scratch &
 0.705 &
 0.851 &
 0.559 \\

 & Volume Fusion &
 0.675 &
 0.793 &
 0.557 \\
 & Model Genesis &
 0.704 &
 0.853 &
 0.556 \\
 & \textbf{MAE} &
 \textbf{0.715} &
 \textbf{0.832} &
 \textbf{0.597} \\
\hline
\multirow{4}{*}{\textbf{UMamba-MTL}}
 & Scratch &
 0.716 &
 0.828 &
 0.603 \\
 & Volume Fusion &
 0.697 &
 0.829 &
 0.566 \\
 & Model Genesis &
 0.721 &
 0.836 &
 0.607 \\
 & \textbf{MAE} &
 \textbf{0.746} &
 \textbf{0.845} &
 \textbf{0.647} \\
\hline
\end{tabular}
\end{table}


Overall results indicate that, among SSL strategies, all methods improved downstream PI-CAI performance except for volume fusion, with MAE providing the most consistent and substantial gains across pretext tasks. Moreover, in terms of backbone performance, UMamba-MTL consistently outperformed UMamba, nnU-Netv2 and Swin-UNETR.

To qualitatively illustrate predictions, \autoref{fig:qualitative_result} shows csPCa detection maps generated by UMamba-ProSSL on in-house T$_2$w, ADC, and HBV images, along with the corresponding prostate and zonal mask predictions.

The results for auxiliary prostate and zonal mask predictions on the in-house dataset (N = 200) are also reported in terms of the dice similarity coefficient (DSC) and are provided in \hyperref[Appendix: Zonal Segmentation]{Appendix~\ref*{Appendix: Zonal Segmentation}}.


\begin{figure}[h]
\centering
% \hspace*{0.7cm}
\includegraphics[width=1\textwidth]{figures/prediction_examples_tp_fp_fn.pdf}
\caption{Qualitative results for UMamba-ProSSL on the in-house dataset (N=200). Ground truth (GT) annotations for clinically significant prostate cancer (csPCa) and the prostate zones (peripheral zone (PZ) and transition zone (TZ)) are overlaid on $\text{T}_2\text{w}$ images. Model predictions for csPCa and zonal anatomy are overlaid on T$_2$w, ADC, and HBV modalities. The orange contours on the ADC and HBV images illustrate the lesion hypointensities and hyperintensities, respectively. The first row shows a true positive (TP) case, the second row shows a false negative (FN) case, and the last row shows a false positive (FP) case.}
 \label{fig:qualitative_result}
\end{figure}


\subsection{Ablation studies}
We conducted two key ablation studies using the PI-CAI public training set, evaluating performance using the mean results across five cross-validation folds. First, we assessed the benefits of large-scale pretraining by artificially reducing the proportion of labeled data. Second, we investigated fine-tuning strategies to determine the optimal approach for transferring learned weights to the downstream task, evaluating different combinations of encoder freezing, decoder initialization, and learning rate selection.


\subsubsection{Effect of large-scale pretraining}
\label{result: low data regime}
In many medical domain applications, particularly in csPCa detection, labeled data is often scarce. Even within the PI-CAI public training and development sets, human annotation data is available for only 220 cases. SSL-based pretraining can ease the burden on radiologists and achieve higher performance than models trained from scratch using the same labeled data. To measure the effect of large-scale pretraining, we artificially reduced the number of images in the PI-CAI training and development set by percentage (10\%, 30\%, 50\%, and 70\%) to simulate low-data regimes. This reduction was achieved by stratifying the data based on ISUP grades and human annotations. Specifically, cases were first grouped by ISUP grade, and for clinically significant cases (ISUP > 1), an additional stratification was applied based on whether expert lesion annotations were available. Stratified sampling was then performed to ensure that each reduced subset preserved the original distribution of disease severity and the proportion of human-annotated cases. The outcomes are presented in \autoref{fig:percentage}. 

The pretrained model achieved a substantially higher score than its counterpart trained from scratch when utilizing only 120 images. Notably, the pretrained model attained a superior score with just 50\% of the data ($N=600$) compared to the scratch model using 70\% of the data ($N=840$). Ultimately, with the utilization of the full 100\% dataset, the scratch model finally overcame its initialization by leveraging the strong supervisory signals provided by the large data size. While the scratch model's final score is extremely close, the pretrained approach exhibits superior convergence and robust generalization, as substantiated by the findings in \autoref{tab:picai_test_100_95ci} and \autoref{tab:picai_test_1000_ci95}.
\label{Appendix: ablation_low_data}


\begin{figure}[htbp]
    \centering
    \includegraphics[width=0.6\textwidth]{figures/low_data_v1.pdf}
    \caption{Comparison of PI-CAI scores between UMamba-ProSSL and UMambaMTL-Scratch, shown across varying labeled set sizes. Each point represents the mean value over five cross-validated folds.}
    \label{fig:percentage}
\end{figure}



\subsubsection{Fine-tuning strategies}
\label{results: fine-tuning strategies}

We investigated different fine-tuning strategies on our best-performing model, UMamba-ProSSL, to determine the optimal approach for transferring learned weights to the downstream task. We evaluated these strategies by varying decoder initialization and encoder freezing during warmup, resulting in four ablation settings. The number of warmup epochs was empirically set to 10, determined by translating the warmup iterations from ~\cite{wald2025revisiting} into relative epochs for our setup. We also explored different maximum learning rates and results are summarized in \autoref{tab:fine_tuning_strategies}.

The results show a significant performance drop when the decoder is randomly initialized and the encoder is frozen during the warmup period. Furthermore, transferring both encoder and decoder weights while freezing the encoder early also leads to a performance decline.  Lastly, using a smaller learning rate during fine-tuning does not necessarily result in improved performance.   

\begin{table}[t]
\centering
\caption{Comparison of different fine-tuning strategies for the UMamba-ProSSL on the PI-CAI public training and development set. Results assess the impact of Encoder/Decoder pretraining and freezing alongside variations in the Maximum Learning Rate (Max. LR). Performance is reported as the mean PI-CAI score across five cross-validation folds.}
\begin{tabular}{@{} c c c c c @{}}
\toprule
\textbf{Encoder Pretrain} & \textbf{\makecell{Encoder Freeze \\ during warmup}} & \textbf{Decoder Pretrain} & \textbf{Max. LR} & \textbf{PI-CAI Score} \\
\midrule
Yes & No  & No  & 1e$^{-4}$ & 0.710 \\ \hline
Yes & Yes & No  & 1e$^{-4}$ & 0.639 \\  \hline
Yes & Yes & Yes & 1e$^{-4}$ & 0.682 \\  \hline
Yes & No  & Yes & 1e$^{-3}$ & 0.715 \\
\textbf{Yes} & \textbf{No} & \textbf{Yes} & \textbf{1e$^{-4}$} & \textbf{0.716} \\
Yes & No  & Yes & 1e$^{-5}$ & 0.712 \\
\bottomrule
\end{tabular}
\label{tab:fine_tuning_strategies}
\end{table}

