
\section{Discussion and conclusion}
In this study, we show that the UMamba-MTL backbone with MAE pretraining (UMamba-ProSSL) achieves state-of-the-art PCa detection on bpMRI. It demonstrates superior performance on the PI-CAI hidden tuning cohort (N=100), the large-scale, multi-center, multi-site hidden testing cohort (N=1,000), and OOD external P158 (N=158) datasets . Among these, the PI-CAI leaderboard is the most comprehensive benchmark for PCa detection available to date.


On the PI-CAI hidden tuning set\footnote{\url{https://pi-cai.grand-challenge.org/evaluation/open-development-phase/leaderboard/}}, UMamba-ProSSL obtained a PI-CAI score of 0.818 (95\% CI: 0.730–0.898), an AUC of 0.914 (95\% CI: 0.852–0.963), and an AP of 0.722 (95\% CI: 0.592–0.846), achieving a 3.02\% relative gain in the PI-CAI score over model genesis \cite{zhou2021models} and a 0.5\% gain over the second-ranked method on the leaderboard. At the time of evaluation, these results placed UMamba-ProSSL 1$^{\text{st}}$ among 775 entries, highlighting its competitiveness.
The synergy between MAE pretraining and the UMamba backbone leads to tangible performance improvements. Notably, MAE pretraining also boosts the plain UMamba model's performance, resulting in a $2.50\%$ relative gain in AUC and a substantial $8.56\%$ relative gain in AP compared to training from scratch. While MAE pretraining enhances the Spark3D model \cite{wald2025revisiting} and Swin-UNETR \cite{hatamizadeh2021swin} performance relative to their baseline, a gap in PI-CAI score persists compared with our proposed model. This disparity may be due to the anisotropic nature of prostate imaging and thus further supports our choice of using a strong backbone architecture for robust performance. We also observed that volume fusion \cite{wang2023mis}, a segmentation-based pretraining optimizing dice metric, is not optimal for the downstream PCa detection task, especially considering the ineffectiveness of the dice score for evaluating multifocal PCa lesions \cite{yan2022impact}.

UMamba-ProSSL was further assessed on the PI-CAI hidden test set\footnote{\url{https://pi-cai.grand-challenge.org/evaluation/challenge/leaderboard/}} (N=1,000), attaining a PI-CAI score of 0.780 (95\% CI: 0.747–0.813), an AUC of 0.905 (95\% CI: 0.885–0.924), and an AP of 0.655 (95\% CI: 0.603–0.706). Also, at a clinically relevant operating point corresponding to a radiologist-equivalent false-positive per examination (PI-RADS $\geq 3$), the model achieved a lesion-level sensitivity (Sens3) of 0.761, the highest among all leaderboard entries, compared with a sensitivity of 0.961 reported for human readers on the PI-CAI hidden test set. These results positioned UMamba-ProSSL highest among the 45 submissions on the leaderboard, surpassing all previous CNN, nndetection, and transformer-based methods. Critically, this result was achieved on a rigorously designed multi-center cohort from eight sites using Siemens and Philips MRI scanners, featuring significant heterogeneity in patient demographics, disease prevalence, and imaging acquisition protocols. Securing first place under these real-world retrospective conditions demonstrates UMamba-ProSSL's robustness, strong clinical generalizability and sensitivity in this European setting.

On the PI-CAI hidden test set, the baseline UMamba-MTL model also performed commendably, achieving a PI-CAI score of 0.776 (95\% CI: 0.704-0.807), an AUC of 0.896 (95\% CI: 0.875-0.916), and an AP of 0.656 (95\% CI: 0.606-0.704). These results further signify the nature of the UMamba architecture in capturing long-range dependencies through the integration of zones in the auxiliary task. Although the absolute performance difference between UMamba-ProSSL and the scratch UMamba-MTL model on the hidden test set is modest, overall gain by pretraining is consistent across multiple complementary metrics and evaluation settings. Notably, UMamba-ProSSL achieves higher sensitivity at the clinically relevant operating point (Sens3) and more than 3\% performance gain over its non-pretrained counterpart on the OOD external P158 dataset, further highlighting the model's clinical relevance, generalizability and robustness.

The qualitative results presented in \autoref{fig:qualitative_result} show that csPCa detection is a challenging task. The false negative cases are often attributed to isointensities in the ADC and DWI modalities, which obscure true cancerous lesions. Conversely, false positives might reflect prostatitis or other benign findings that exhibit hypo- and hyperintensities in the ADC and high b-value (HBV) DWI modalities, respectively, leading the model to incorrectly identify them as true lesions. For true positive lesions, our method (UMamba-ProSSL) achieved a high average precision (AP), demonstrating high lesion-level precision across both the PI-CAI hidden tuning and hidden testing sets. The high precision is crucial for guiding biopsy procedures, as the patient-level AUC does not reflect lesion localization. High precision directly improves the urologist's ability to sample correctly during biopsy, thereby enhancing diagnostic accuracy and reducing unnecessary procedures.


The effectiveness of the large-scale pretraining approach, particularly its observed label efficiency in the low-data regime, directly addresses clinical settings where vast archives of routine, unlabeled bpMRI images are available. Our SSL method using MAE leverages this data to learn generalizable representations before fine-tuning, thereby improving csPCa detection while reducing reliance on scarce and expensive expert annotations.

Several limitations of our study guide future investigations. First, regarding comparative performance, although the nnU-Netv2 model by \cite{pooch2025semi} was trained with substantially more human labels (425 human vs. our 220 human and 205 AI annotations), it achieved lower performance on the same benchmark. Future work should include an analysis using fully human-labeled data to further isolate the effect of large-scale pretraining and better characterize performance in low-data regimes. 
Second, while our method cannot be directly compared to the reader study of \cite{saha2024artificial} due to differences in cohort size and study design, valuable contextual insights can still be drawn. The study reported an AUC of 0.86 (95\% CI: 0·83–0·89) on 400 cases, whereas our approach achieved an AUC of 0.905 (95\% CI: 0·885–0·924) on 1,000 cases from the PI-CAI hidden test set. Although not directly comparable, this suggests that our model is competitive with expert readers in patient-level diagnosis. Since our evaluation on the PI-CAI dataset is retrospective, prospective clinical studies are essential to fully assess the utility of our method.

Although we evaluated on the external Prostate158 (P158) dataset to assess OOD generalization, P158 does not follow the same reference standard as PI-CAI, particularly with respect to the explicit distinction between clinically significant and clinically insignificant PCa. To our knowledge, no other publicly available dataset currently matches the scale and standardized labeling of PI-CAI. Consequently, the lower absolute PI-CAI scores observed on P158 should be interpreted as a dataset-related limitation.


Finally, while MAE outperformed other SSL pretext tasks, future research should explore prostate-specific tasks that leverage anatomy and clinical data, while also investigating how performance scales with unlabeled data volume to identify potential diminishing returns. Furthermore, while we compared against hybrid CNN–Transformer architecture (Swin-UNETR) pretrained with an MAE objective, future work should explore pure transformer-based models and a broader range of self-supervised objectives, as the performance of Swin-UNETR was relatively limited in the context of prostate MRI.

In conclusion, UMamba-ProSSL achieves state-of-the-art performance on a large-scale benchmark by combining a UMamba-MTL backbone with MAE-based large-scale self-supervised pretraining, thereby advancing prostate cancer detection.

