
\section{Methodology}

\subsection{Datasets}

The datasets utilized in this study include the training cohort (N=1500) (which incorporates the ProstateX dataset \cite{litjens_spie-aapm_2017}), and the hidden tuning cohort (N=100) of the PI-CAI dataset \cite{saha_artificial_2022}. 425 cases in the PI-CAI training cohort are confirmed histologically to have clinically significant PCa (csPCa), defined as grade group $\geq$ 2. Of the csPCa cases in the PI-CAI training cohort, 220 cases include human expert annotations, while the remaining csPCa cases are derived from the approach outlined in \cite{bosma_semisupervised_2023}.
Transition zone (TZ) and peripheral zone (PZ) masks for the training subset were AI-generated using a standard nnUNet \cite{isensee_nnu-net_2021}, trained on the ProstateX subset of the PI-CAI training data \cite{yuan_z-ssmnet_2022}.

The study further incorporated an in-house dataset (N=200) from NTNU/St. Olavs hospital, Trondheim, Norway \cite{kruger-stokke_multiparametric_2021}, along with the Prostate158 dataset (N=158) \cite{adams_prostate158_2022}. Both datasets provide expert annotations for PCa and zonal anatomy. PCa annotations for the in-house cohort are defined in the same manner as those in the PI-CAI datasets, but Prostate158 includes grade group 1, and is thus excluded from the csPCa detection assessment in this study. A resident radiologist with at least two years of experience at St. Olav's Hospital, Trondheim, delineated the in-house dataset using ITK-SNAP software in collaboration with a senior radiologist with over ten years of experience in prostate MRI. Annotations encompassed all MRI-visible lesions classified by PI-RADS, histopathologically confirmed lesions from biopsy or radical prostatectomy, and zonal anatomy.

An overview of all datasets utilized in this study is provided in \tableref{tab:dataset} and the clinical variables for each cohort can be seen in \appendixref{appendix:clinical-variables}. All datasets in this study include T2-weighted (T2W), apparent diffusion coefficient (ADC), and high b-value (HBV) diffusion-weighted images, collectively referred to as bpMRI. 


\begin{filecontents*}{pca_data_sources.csv}
Dataset;Cases;Type; Annotations
In-House$^1$;200;3T mpMRI; PCa, Zonal
PI-CAI Training cohort;1500;1.5T, 3T bpMRI; PCa$^2$, Zonal$^3$
PI-CAI Hidden tuning cohort;100;1.5T, 3T bpMRI; PCa
Prostate158; 158; 3T bpMRI; PCa, Zonal
\end{filecontents*}

\begin{table}[ht]
    \centering
    \csvautobooktabular[separator=semicolon]{pca_data_sources.csv}
    \caption{Prostate cancer dataset information. $^1$ denotes that the dataset is contained within the PI-CAI hidden test set cohort, $^2$ denotes that a subsection of the labels are AI generated (N=200) and  $^3$ denotes that all the masks are AI generated}
    \label{tab:dataset}
\end{table}



\subsection{Network Architecture}


We implemented the U-Mamba architecture \cite{ma_u-mamba_2024} to investigate the hypothesis that the enhanced long-range dependency capabilities of Mamba \cite{gu_mamba_2023}  will be beneficial for PCa detection. As the performance of the Enc and Bot variant is reported to be similar, we opted for the Bot variant due to it's reduced  computational complexity \cite{10.1007/978-3-031-72114-4_47}.

% In contrast to the original code which uses the nnUNet framework \cite{isensee_nnu-net_2021}, we reimplemented the architecture using a PyTorch Lightning \cite{falcon_pytorch_2024} and MONAI \cite{consortium_monai_2024} based setup for enhanced flexibility.

The particular configuration of the U-Mamba architecture used in this paper consists of 7 convolution stages in the encoder and decoder, where each stage in encoder consists of $2$  ($3\times3$) convolutions. The decoder consists of the upsampling blocks in addition to a residual block. The bottleneck consists of the mamba-based block called the U-Mamba block in addition to a residual block. 
% The full U-Mamba architecture overview can be seen in \figureref{fig:U-Mamba-arch}.

% \begin{figure}[htbp]
% \includegraphics[width=\textwidth]{figures/u-mamba-combined.png}
% \caption{General overview of the U-Mamba (Bot) architecture} \label{fig:U-Mamba-arch}
% \end{figure}


\subsection{U-Mamba MTL}

% MTL Review in MIC:
% \cite{zhao_multi-task_2023}

To investigate whether incorporating zonal masks (TZ and PZ) improves PCa prediction, we explored two multitask learning (MTL) strategies using U-Mamba as the base architecture. Since zonal and PCa masks are not mutually exclusive, they can be treated as separate tasks. When tasks are highly related, a shared-parameter strategy is typically more effective. Conversely, if they are less related, allocating more task-specific parameters may be beneficial. Determining the optimal balance, however, requires experimentation.

% To explore the hypothesis that the inclusion of zonal masks (TZ and PZ) is beneficial for PCa prediction, we investigated two multitask learning (MTL) strategies using U-Mamba as the base architecture. As zonal and PCa masks are not mutually exclusive, they can be considered separate tasks. In cases where the different tasks are highly related, we would expect a strategy where most of the network architecture uses shared parameters to be the most effective. On the other hand, if the tasks are related to a lesser degree, one would assume that a strategy were a larger portion of the network is not shared to be more effective. However, the optimal split requires experimentation.

We define the two tasks as $T_0 = \text{PCa}$ and $T_1$ = Peripheral Zone (PZ) and Transitional Zone (TZ) zonal masks. Our U-Mamba MTL architectures can then be formulated as:

\begin{equation}  
\begin{aligned}  
    \mathbf{z} &= f_{\text{enc}}(\mathbf{x}; \theta_{\text{enc}}),\\  
    \mathbf{y}_i &= f_{\text{dec}_i}(\mathbf{z}; \theta_{\text{dec}_i}), \quad \forall i \in \{1, \dots, N\}  
\end{aligned}  
\end{equation}  

% The first MTL strategy (U-Mamba MTL-Dual) is defined with $N=2$ such that two decoder branches predict $y_{T_0}$ and $y_{T_1}$. The encoder is then shared between the two tasks such that a shared representation can be learned in the shared parameters $\theta_{\text{enc}}$, and task specific representation is represented in the corresponding decoder parameters $\theta_{\text{dec}_i}$. The second strategy (U-Mamba MTL-Single) is defined with $N=1$ such that both $y_{T_0}$ and $y_{T_1}$ is predicted by a single decoder where all parameters are shared for both tasks.

% Except for the additional decoder branch in the U-Mamba MTL-Dual, the networks follows the same exact structure as the U-Mamba network described above, and the full architecture overview of our MTL architectures can be seen in \figureref{fig:U-MambaMTL-arch}.

The first strategy, U-Mamba MTL-Dual, uses $N=2$, meaning two decoder branches separately predict $y_{T_0}$ (PCa) and $y_{T_1}$ (zonal masks). The encoder is shared, learning a common representation, while each decoder branch captures task-specific features.

The second strategy, U-Mamba MTL-Single, uses $N=1$, meaning a single decoder predicts both $y_{T_0}$ and $y_{T_1}$, sharing all parameters across tasks.

Aside from the additional decoder in U-Mamba MTL-Dual, both models maintain the same overall structure as the base U-Mamba network. A complete architectural overview is provided in \figureref{fig:U-MambaMTL-arch}.


\begin{figure}[ht]
\includegraphics[width=\textwidth]{figures/u-mamba-arch.png}
\caption{Architectural overview of U-Mamba MTL-Single and U-Mamba MTL-Dual}\label{fig:U-MambaMTL-arch}
\end{figure}

% \begin{figure}[htbp]
% \includegraphics[width=\textwidth]{figures/U-MambaMTL.png}
% \caption{Architectural overview of U-Mamba MTL-Single and U-Mamba MTL-Dual }\label{fig:U-MambaMTL-arch}
% \end{figure}




% In order to effectively learn the two separate tasks, a design choice of how much of the networks are shared and how much is seperate needs to be chosen. We would expect 


% A critical choice when selecting an MTL strategy is how much of the network is shared between the tasks. 

% In the first MTL strategy, U-Mamba is extended with an additional decoder, each responsible for predicting one of the two tasks.


% we extended the original U-Mamba architecture using a parallel multi-task learning strategy. As zonal masks (TZ and PZ) are considered an auxiliary task and the difficulty of predicting zonal masks is comparably easier than the PCa prediction, we opted to consider the zonal masks as a single task. We would expect these two classes to work well within a shared decoder, as the two classes are closely related, and by limiting our architecture to two decoders, additional computational complexity is avoided in terms of trainable parameters.


% We define the two tasks as $T_0 = \text{PCa}$ and $T_1$ = Peripheral Zone (PZ) and Transitional Zone (TZ) zonal masks. Our U-Mamba MTL architecture can then be formulated as:

% \begin{equation}  
% \begin{aligned}  
%     \mathbf{z} &= f_{\text{enc}}(\mathbf{x}; \theta_{\text{enc}}),\\  
%     \mathbf{y}_i &= f_{\text{dec}_i}(\mathbf{z}; \theta_{\text{dec}_i}), \quad \forall i \in \{1, \dots, N\}  
% \end{aligned}  
% \end{equation}  

% Where $N=2$ such that the two decoder branches predict $y_{T_0}$ and $y_{T_1}$. The encoder is then shared between the two tasks such that a shared representation can be learned in the shared parameters $\theta_{\text{enc}}$, and task specific representation is represented in the corresponding decoder parameters $\theta_{\text{dec}_i}$. Except for the additional decoder branch, the network follows the same exact structure as the U-Mamba network described above, and the full architecture overview of our U-MambaMTL can be seen in \figureref{fig:U-MambaMTL-arch}.

% \begin{figure}[htbp]
% \includegraphics[width=\textwidth]{figures/U-MambaMTL.png}
% \caption{General overview of our U-Mamba MTL Architecture (see \figureref{fig:U-Mamba-arch} for block descriptions) }\label{fig:U-MambaMTL-arch}
% \end{figure}



\subsection{Loss Functions}

% The PCa class observes a severe class imbalance (class imbalance here), is not present for all patients, and is generally deemed as a hard task to predict

The two different tasks we aim to predict with our U-Mamba MTL architectures observe very different characteristics, which can cause issues with convergence if not handled carefully. The PCa task observes a severe class imbalance compared to the background, and is not present in all cases. These observations fits well with the selection criteria for the Focal loss function. The zonal mask prediction task on the other hand observes a moderate class imbalance compared to the background, and is present in all cases. Therefore, a combination of Dice and CE loss is deemed more suited for this task.

Due to the scale difference between the two task losses and the difference in relative difficulty of the tasks, a balancing factor $\beta$ is introduced. If we formulate the two targets in our MTL variants of U-Mamba as $T_0 = \text{PCa}$ and $T_1$ = PZ and TZ zonal masks, the full formulation of the multi-task loss can be defined as:


\begin{align}
    \mathcal{L}_{T_0} &= \mathcal{L}_{\text{Focal}}, \quad 
    \mathcal{L}_{T_1} = \lambda \mathcal{L}_{\text{Dice}} + (1 - \lambda) \mathcal{L}_{\text{CE}}, \quad 
    \mathcal{L} = \mathcal{L}_{T_0} + \beta \mathcal{L}_{T_1}.
\end{align}

The weight balancing parameter $\lambda = 0.5$ which gives equal weight to the Dice and Cross Entropy component of $\mathcal{L}_{T_1}$. The weight balancing parameter $\beta$ is set to 0.2 to balance both loss range and the relative difficulties of the tasks. Please note that the un-altered U-Mamba network uses $\mathcal{L}_{\text{Focal}}$ as its only loss function.

% As the combination of the two zonal masks is equivalent to the whole prostate, we added a 4th channel (including background) defined as the combination of the two zones in order to guide the network to satisfy the combined prediction as the whole prostate shape.


\subsection{Model Training}

Each model was trained using the PI-CAI challenge training dataset (N=1500) split into a training and a validation set by using 5-fold cross validation. Each split contains approximately 80\% for training and 20\% for validation. 

All models were trained using 5-fold cross-validation for 200 epochs, a choice driven by observed early convergence during development, typically around 100 epochs. In contrast, the baseline models from the PI-CAI challenge organizers were trained for 1000 epochs. Training was conducted on a single A100 GPU (80GB VRAM) using a cosine annealing learning rate scheduler and the AdamW optimizer, producing five model weights per model. Final predictions for each model were generated using a mean ensemble across all model fold predictions.


% Final predictions are generated by using a mean ensemble of all 5 model predictions.

To enhance the dataset diversity for model training, a set of data augmentations was used to augment the training data each epoch randomly. To ensure equal size of each image, we resample all images to the common spacing and perform crop or pad using the prostate as the center. Specific settings for each augmentation can be seen in \appendixref{appendix:data-aug}.


\subsection{Baseline Models}

To assess the performance of our model in relation to current state-of-the-art (SOTA) we opted to use the three baseline methods provided by the PI-CAI Challenge organizers which includes: nnUNet \cite{isensee_nnu-net_2021}, nnDetection \cite{baumgartner_nndetection_2021} and a standard U-Net \cite{ronneberger_u-net_2015}. In addition to the PI-CAI baselines we trained a SOTA transformer model called Swin UNETR \cite{hatamizadeh_swin_2022} using the same setup as the U-Mamba and our U-Mamba MTL model, except for the input size in the Z-dimension which was set to 32 due to model requirements.

% These models were all trained for 5 folds, where the model weights and inference pipeline were used to produce a comparison on the in-house dataset. 

\subsection{Metrics}
\label{sec:metrics}

% We opted for a combination of average precision (AP) and area under the receiver operating characteristic curve (AUC) for assessing the PCa segmentations, as per PI-CAI recommendations. AP and AUC are calculated by considering lesion candidates. Each lesion candidate is extracted from the PCa probability map by iteratively selecting the maximum probability and generating a connected component of the surrounding voxels, where the probability threshold is 40\% of the maximum probability. This process is done until there are no more candidates above the dynamic threshold. The probability of each lesion candidate is set by the maximum value within each candidate. The final set of lesion candidates for one sample is called a detection map \cite{bosma_semisupervised_2023}. 

% In the AP calculation, a lesion is considered true positive if the intersection over union (IoU) is above 10\%.  The AUC is calculated per patient by considering the highest probability lesion candidate. An average between AP and AUC is used as a combined score to give an overall performance metric for both lesion-level detection and patient-level PCa classification. The metrics are calculated using the provided picai\_eval script from the PI-CAI challenge \cite{saha_artificial_2024}. 


We assess PCa segmentations masks using average precision (AP) and area under the receiver operating curve (AUC), following PI-CAI guidelines \cite{saha_artificial_2024}. In order to compute the metrics, non-overlapping lesion candidates are extracted from the PCa probability map. The lesion candidates are iteratively extracted by selecting the voxel with the maximum probability and selecting all connected voxels with a minimum of 40\% of its peak probability \cite{bosma_semisupervised_2023}. A PCa detection map is defined as the collection of all lesion candidates for a given case, where each lesion candidate have a single probability defined by its maximum probability. 

A lesion is considered true positive in the AP calculation if its intersect over union exceeds 10\%. AUC is computed per patient using the highest probability in the PCa detection map. The combined performance metric averages AP and AUC to evaluate lesion detection and patient-level PCa classification. The metrics are calculated with the picai\_eval script provided by the PI-CAI challenge \cite{saha_artificial_2024}.
