Self-supervised Vision Transformers  for Prostate Cancer Classification in Biparametric MRI

Shebna Rose Fabilloren; Jose Conrado T. Paulino; Johanna Patricia A. Cañal; Prospero C. Naval Jr.

Self-supervised Vision Transformers for Prostate Cancer Classification in Biparametric MRI

Shebna Rose Fabilloren, Jose Conrado T. Paulino, Johanna Patricia A. Cañal, Prospero C. Naval Jr.

Published: 21 Jul 2025, Last Modified: 20 Aug 2025MSB EMERGE 2025 ConditionalrequiresmajorrevisionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: prostate cancer, vision transformer, biparametric mri

Abstract: Multiparametric and biparametric magnetic resonance imaging (mpMRI/bpMRI) play an essential role in the detection, pre-biopsy planning, and staging of clinically significant prostate cancer (csPCA). One of the most commonly used structured reporting schemes in the evaluation of prostate MRI’s for suspected prostate cancer is the Prostate Imaging–Reporting and Data System (PI-RADS) v.2.1, developed by multiple international representative groups. Existing machine learning models for classifying csPCa using PI-RADS are not reproducible due to the availability of data sets. Meanwhile, public datasets lack PI-RADS labels, a standard in prostate MRI. This hinders progress in the research community. FastMRI Prostate is a recently released, publicly available slice-level MRI dataset with PI-RADS labels. However, research using it is limited due to its recent release, and no studies have yet applied DINOv2 for csPCa classification on bpMRI. Several medical imaging studies have shown DINOv2 to be an effective feature extractor. This study aims to address these gaps by assessing the advantages and limitations of the DINOv2 family of foundation models on the FastMRI Prostate dataset for binary csPCa classification. Our findings reveal that DINOv2 models outperformed other ImageNet pretrained CNN-based models. ViT-g variant obtained an AUROC = 0.889 for the T2W model and 0.862 for the DWI model. This suggests DINOv2 features representations are adaptable to this downstream task. There was minimal performance difference between ViT-g and ViT-L, but a two-fold difference in training time and VRAM needed, making it a good alternative when computational resources are limited. ViT-S (21M parameters) achieved comparable performance to ResNet-152 (60M parameters). Overall, this suggests that DINOv2 models offer a good trade-off between performance and computational cost, making them a viable option even in resource-constrained environments.

Camera Ready Submission: zip

Submission Number: 8

Loading