\section{Method}
\cref{fig:overall_arch} shows the overall process of the U-Mamba2-SSL framework, consisting of three training stages where we first pre-train the U-Mamba2 \cite{umamba2} model with reconstruction objectives, then combine supervised loss for the labeled data and unsupervised loss with consistency regularization for the unlabeled data. The final third stage introduces pseudo labeling to the training objectives. 

U-Mamba2 integrates Mamba2 \cite{mamba2} state space models into the U-Net architecture at the bottleneck region to enhance its ability to capture long-range dependencies. Mamba2 improves upon Mamba \cite{mamba} by enforcing stronger constraints on the hidden space structure, leading to higher efficiency without compromising its performance compared to transformer-based alternatives. 
We present the details of the three training stages: pre-training, consistency regularization training, and pseudo labeling, in the following subsections.
Note that the final checkpoint of each training stage is used to initialize the model of the subsequent stage.

\begin{figure}[tb]
    \centering
    \includegraphics[width=\linewidth]{fig/semi-overall.drawio.png}
    \caption{
    Overall diagram of the proposed U-Mamba2-SSL framework. 
    (a) The first pre-training stage; (b) The second consistency regularization training stage; (c) The third pseudo labeling training stage.
    }
    \label{fig:overall_arch}
\end{figure}

\paragraph{\textbf{Problem Formulation.}}
Let $\mathcal{D}_l=\{(x_{1}^l, y_{1}), ..., (x_{n}^l, y_{n})\}$ represent the $n$ labeled samples and  $\mathcal{D}_{u}=\{x_1^u, ..., x_m^u\}$ represent the $m$ unlabeled samples, where $x_i^l \in \mathbb{R}^{H \times W \times D}$ is the $i$-th labeled input image, $y_i \in \mathbb{R}^{C \times H \times W \times D}$ is its corresponding voxel-level label, and  $x_i^u$ is the $i$-th unlabeled input image. Here, $C$ is the number of classes while $H, W, D$ are the spatial dimensions. Our goal is to exploit the larger number of unlabeled samples (\ie $m \gg n$) to train a 3D segmentation model.

\subsection{First Stage: Pre-training with Disruptive Autoencoder}
\label{subsec:pretrain}
In the medical image domain, data scarcity due to various factors such as complex ethical regulations for accessing and releasing datasets publicly, presents challenges to model pre-training. Therefore, unlike in computer vision tasks of natural images, models for medical image applications are often trained from scratch with random initialization of model weights. However, recent works \cite{dae,Tang2021SelfSupervisedPO} have shown that pre-training deep learning models for medical image tasks can lead to better models that can extract meaningful feature representations to enhance the performance of downstream segmentation tasks, particularly when there is limited labeled data to train from scratch effectively. 

In the first stage of our proposed SSL framework, we utilize all training data (\ie $\mathcal{D}_l \cup \mathcal{D}_u$) to pre-train U-Mamba2 via the disruptive autoencoder (DAE) \cite{dae} method. The DAE method combines three low-level reconstruction tasks for pre-training, namely denoising, super-resolution, and recovering masked information. 

Denoising refers to the task of restoring the original input from its noisy version, obtained by introducing random additive Gaussian noise to the original input. The model must learn to restore all local details in images, such as edges and textures, to output a good denoised image. Besides that, super-resolution is the task of increasing the resolution of a low-resolution image, created artificially by downsampling the original input with linear interpolation. To obtain a good upsampled image, the model must be able to recover the fine details of the image with both local and global information. Lastly, we apply masking to random cubical regions in the input image, setting the voxel values to zero. As most of the information in medical images is not global but is in the finer local details, we use a small cube size relative to the spatial dimensions of the input to prevent discarding too much local information. The model is directed to recover the masked regions, leading to the ability to extract meaningful global context. 
After applying the three input disruptions, U-Mamba2 learns to reconstruct the original image from the corrupted input with an L1 loss function. 

\subsection{Second Stage: Consistency Regularization Training}
We exploit the smoothness assumption and employ consistency regularization training in the second training stage, enforcing the invariance of predictions on the model. In this training stage, we use a combination of supervised loss and unsupervised loss to learn the model parameters. For a labeled training sample, $x_{i}^l$, and its voxel-level class label, $y_{i}$, the model is trained in a supervised fashion based on the combination of Dice loss and cross-entropy loss, $\mathcal{L}_{S}$. For an unlabeled training sample, $x_{i}^u$, it is first passed through the model to obtain an unperturbed output, $\hat{y}_{i}^u$. Then, we introduce input and feature perturbations \cite{cct} to $x_{i}^u$ and obtain the perturbed output, $\tilde{y}_i^u$, by passing the perturbed input through the model. The semi-supervised consistency regularization loss, $\mathcal{L}_{CR}$, is computed as the $\mathcal{L}_1$ loss between $\hat{y}_{i}^u$ and $\tilde{y}_i^u$. We describe the perturbation details in the following paragraphs.

\subsubsection{Input Perturbations.}
We apply strong data augmentation to the unlabeled data to obtain a perturbed input. It is crucial not to apply spatial (\eg mirroring or rotation) augmentations, as in the context of segmentation, these transformations are non-local and violate the smoothness assumption. Specifically, in this stage, we apply median filter, Gaussian blur, Gaussian noise, random brightness, random contrast, low-resolution simulation, and image sharpening filter.

\subsubsection{Feature Perturbations.}
The perturbed inputs are passed through the encoder blocks in U-Mamba2 to obtain multi-scale 3D feature maps. Before the encoder feature maps are connected to the decoder blocks via skip connections, we apply random perturbations in the feature space to encourage the model to learn more robust and generalizable feature representations. The feature perturbations consist of dropping activations or injecting noise in the encoder feature maps:
\begin{itemize}
    \item Random Spatial Dropout \cite{spatial_dropout}: We apply random channel-wise dropout with a probability of $0.5$. In contrast to i.i.d. dropout, this promotes channel-wise independence in the encoder feature maps.
    \item Random Activation Dropout \cite{dropout}: Activations with high values are randomly dropped to enforce the model to focus on inactive regions in the feature map. We randomly sample a threshold, $\gamma_{drop} \sim \mathcal{U}(0.7, 0.9)$, then set all activations above the $\gamma_{drop}$ percentile to zero. As a result, the top $10\%-30\%$ highly activated regions in the feature map are dropped.
    \item Noise Injection: A noise tensor with the same shape as the feature map is first sampled from a uniform distribution, $N \sim \mathcal{U}(-0.3, 0.3)$. As the activations in the feature maps vary, we ensure that the noise tensor is proportional to the feature map by first multiplying the noise tensor with the feature map before adding it as $Z + (Z \odot N)$, where $Z \in \mathbb{R}^{F \times H \times W \times D}$ is the feature map, $\odot$ is element-wise multiplication, and $F$ is the number of channels.
\end{itemize}

\subsubsection{Semi-Supervised Learning Schedule.}
In practice, we utilize both labeled and unlabeled data during each training epoch. The overall loss signal from both labeled and unlabeled data is computed as 
\begin{equation}
    \mathcal{L} = \mathcal{L}_{S} + \omega_{CR}\mathcal{L}_{CR} \,\,,
    \label{eq:second_stage_loss}
\end{equation}
where $\omega_{CR}$ is the unsupervised loss weight function. $\omega_{CR}$ ramps up exponentially \cite{temp_ensemble} from zero to a fixed weight, $W_{CR}$, at the $0.2T_{ep}$ epoch where $T_{ep}$ is the total number of training epochs. Additionally, we linearly increase the proportion of unlabeled data in each epoch from $10\%$ to $50\%$ at the $0.4T_{ep}$ epoch, allowing the model to focus on learning the main segmentation task in the early phase.

\subsection{Third Stage: Pseudo Labeling}
After the second training stage, we obtain a good U-Mamba2 segmentation model that can maintain local smoothness around its predictions. We capitalize on this feature by further training the model with the pseudo labeling \cite{pseudolabel} strategy. Specifically, the model's predictions on unlabeled samples are considered pseudo labels and used for model training in a supervised manner. For the predicted class of each voxel, if the class confidence is above a given confidence threshold, $\lambda_{conf}$, then we use the predicted class as ground truth; otherwise, the voxel is set to the background class and is ignored in the loss calculation. 

In this stage, the loss function from \cref{eq:second_stage_loss} becomes:
\begin{equation}
    \label{eq:third_stage_loss}
    \mathcal{L} = \mathcal{L}_{S} + \omega_{CR}\mathcal{L}_{CR} + W_{PL}\mathcal{L}_{PL} \,\,,
\end{equation}
where $\mathcal{L}_{PL}$ is the supervised loss computed with the pseudo labels and ignores the background class, and $W_{PL}$ is the loss weight for $\mathcal{L}_{PL}$ to balance the loss terms.
Similar to the second stage, we linearly increase the proportion of unlabeled samples in each training epoch from $30\%$ to $50\%$ at the $0.2T_{ep}$ epoch.
