\section{Experiments}
\subsection{Dataset and Evaluation Metrics}
The evaluation metrics include Dice Similarity Coefficient (DSC), Normalized Surface Distance (NSD), Mean Intersection over Union (mIoU), and Identification Accuracy (IA) to evaluate the segmentation region overlap and boundary distance. In addition, the algorithm runtime and memory consumption are also evaluated and ranked. 

\subsection{Implementation Details}
\paragraph{\textbf{Preprocessing.}}
We resize all inputs to the median voxel spacing of all training data, $(0.3,0.25,0.25)$, resulting in a median input size of $(337,640,640)$. Then, we clip the input data to the 0.5th and 99.5th percentiles, followed by data normalization based on the mean and standard deviation of the voxel values.

\paragraph{\textbf{Environment settings.}}
The development environments and requirements are presented in Table~\ref{table:env}.

\begin{table}[tb]
    \caption{Development environments and requirements.}
    \label{table:env}
    \centering\setlength{\tabcolsep}{4pt}
    \resizebox{0.66\linewidth}{!}{
    \begin{tabular}{ll}
        \hline
        System & Ubuntu 24.04 \\
        \hline
        CPU & Intel(R) Core(TM) Ultra 9 285K \\
        \hline
        RAM & 2 $\times$ 32GB; 6400 MHz\\
        \hline
        GPU & NVIDIA RTX 5090 32 GB \\
        \hline
        CUDA version & 12.9 \\ 
        \hline
        Programming language & Python 3.11 \\ 
        \hline
        Deep learning framework & PyTorch 2.7.1, nnU-Net 2.6.2 \\
        \hline
    \end{tabular}
    }
\end{table}

\paragraph{\textbf{Training protocols.}}
We implement U-Mamba2-SSL with the nnU-Net \cite{nnunet} framework, using a patch-size training and sliding window inference strategy. During training, we randomly apply rotation, scaling, Gaussian noise, Gaussian blur, brightness and contrast transform, low resolution simulation, and mirroring as data augmentation. We randomly crop input patches so that at least 33\% of the voxels contain a foreground label. $W_{CR}$, $W_{PL}$, and $\lambda_{conf}$ are set to $50$, $0.1$, and $0.75$, respectively.
All models have 7 encoder-decoder stages and follow the model configuration in \Cref{table:training}.
The provided 30 labeled training samples are split into 20 training and 10 internal validation splits, where the internal validation split is used to monitor training progress and offline evaluation. We select the checkpoint with the highest DSC on our internal validation set and report the performance metrics on the hidden validation set.


\begin{table}[tb]
    \caption{Training configuration.}
    \label{table:training}
    \centering\setlength{\tabcolsep}{4pt}
    \resizebox{0.66\textwidth}{!}{
    \begin{tabular}{ll} 
        \hline
        Pre-trained Model & See \Cref{subsec:pretrain} \\
        \hline
        Batch size & 2 \\
        \hline 
        Patch size & $128 \times 256 \times 256$  \\ 
        \hline
        Total epochs & 500 \\
        \hline
        Optimizer & SGD with $0.99$ momentum  \\
        \hline
        Initial learning rate  & 0.01 \\ 
        \hline
        Lr decay schedule & Polynomial LR decay \\
        \hline
        Training time & 13 hours \\ 
        \hline 
        Loss function & See \Cref{eq:second_stage_loss,eq:third_stage_loss} \\
        \hline
        Number of model parameters & 156M \\ 
        \hline
        Number of flops & 6.22T\\ 
        \hline
    \end{tabular}
    }
\end{table}


\section{Results and Discussion}
\subsection{Quantitative Results}
\Cref{tab:results} presents the results of our proposed method compared with two baselines, nnU-Net and U-Mamba2. We observe that all methods achieved high DSC, NSD, and mIoU metrics, which measure overall image-level performance. However, U-Mamba2-SSL outperforms others significantly in IA, which calculates the average percentage of classes with IoU $> 0.5$ across all images. The bottom three rows of \Cref{tab:results} also report the ablation study of our proposed method. Notably, pre-training leads to the largest leap in IA, from 0.464 to 0.731, while incorporating consistency regularization and pseudo labeling further increases IA to 0.738.

\begin{table}[tb]
    \caption{Evaluation results on the validation set. CR denotes consistency regularization; PL denotes pseudo label. Our ablation study is reported in the bottom three rows, with the last row referring to the final U-Mamba2-SSL. 
    }\label{tab:results}
    \centering\setlength{\tabcolsep}{4pt}
    \resizebox{0.85\linewidth}{!}{
    \begin{tabular}{l|ccc|cccc|c}
        \hline
        Methods & Pre-train & CR & PL &   DSC   &   NSD   &   mIoU    &   IA  & Average\\ 
        \hline
        nnU-Net \cite{nnunet} & - & - & - & 0.963 & 0.997 & 0.928 & 0.286 & 0.794 \\
        U-Mamba2 \cite{umamba2} & - & - & - & 0.965 & 0.998 & 0.930 & 0.464 & 0.839 \\
        \hline
        \multirow{3}{*}{U-Mamba2-SSL} & \checkmark & \crossmark & \crossmark & 0.967 & 0.998 & 0.937 & 0.731 & 0.908 \\
        & \checkmark & \checkmark & \crossmark & 0.967 & 0.999 & 0.935 & 0.736 & \textbf{0.910} \\
        & \checkmark & \checkmark & \checkmark & 0.967 & 0.999 & 0.935 & 0.738 & \textbf{0.910} \\
        \hline
    \end{tabular}
    }
\end{table}

\subsection{Qualitative Results}
\begin{figure}[tb]
    \begin{subfigure}{0.49\textwidth}
        \centering 
        \includegraphics[width=0.54\linewidth,trim={7cm 1cm 0 3cm},clip]{fig/best_3d_gt.png}
        \includegraphics[width=0.4\linewidth]{fig/best_2d_gt.png}
    \end{subfigure} \hfill
    \begin{subfigure}{0.49\textwidth}
        \centering 
        \includegraphics[width=0.54\linewidth,trim={7cm 1cm 0 3cm},clip]{fig/best_3d_pred.png}
        \includegraphics[width=0.4\linewidth]{fig/best_2d_pred.png}
    \end{subfigure}
    
    \begin{subfigure}{0.49\textwidth}
        \centering 
        \includegraphics[width=0.54\linewidth,trim={0 0 0 2cm},clip]{fig/worst_3d_gt.png}
        \includegraphics[width=0.4\linewidth]{fig/worst_2d_gt.png}
        \caption{Ground Truth}
    \end{subfigure} \hfill
    \begin{subfigure}{0.49\textwidth}
        \centering 
        \includegraphics[width=0.54\linewidth,trim={0 0 0 2cm},clip]{fig/worst_3d_pred.png}
        \includegraphics[width=0.4\linewidth]{fig/worst_2d_pred.png}
        \caption{Prediction}
    \end{subfigure}
    \caption{Qualitative results of U-Mamba2-SSL on the internal validation set. The 3D render and a representative 2D slice are shown for: (Top) the best scoring case and (Bottom) the worst scoring case.}
    \label{fig:qualitative}
\end{figure}

\cref{fig:qualitative} shows the qualitative comparison between the ground truth and our model's predictions of the scans with the highest and lowest DSC in our internal validation set, in the top and bottom rows, respectively. Generally, we observe that our method can accurately differentiate between the tooth and different classes of pulp and root canal. The failure cases of our method typically stem from the inability to precisely predict the thickness and the length or extent of the pulp. Moreover, our model also struggles with limited field of view (LFOV) CBCTs where it predicts more false positives around the image edges.

\subsection{Final Challenge Submission}
We scale up our training procedure by training on all available data for 1000 epochs and increasing the input patch size to 160x256x256. For inference, we use a sliding window inference with a tile size of 0.9, and enable mirroring in the anterior/posterior and left/right axes during test-time augmentation (See \Cref{appx:speed} for the speed optimization).
Our method achieved a 0.969 DSC, 0.998 NSD, 0.940 mIoU, and 0.806 IA on the validation set, while obtaining a DSC, NSD, mIoU, and IA of 0.917, 0.882, 0.948, and 0.577, respectively, on the final hidden test set, securing first place in Task 1 of the STSR 2025 challenge.

\subsection{Limitation and Future Work}
Our work, while successful, is not without limitations. First, the dataset consists of full and LFOV CBCTs, which differ in content and image properties. Next, the IA metric drops significantly on the final hidden test set, signifying possible overfitting or domain shift. Future work should design data processing and augmentation techniques tailored to the different types of CBCTs to leverage their differences and improve model generalizability. Lastly, as only a small region of interest (ROI) in the CBCT image contains the foreground classes,  future research can exploit this to prevent wasting computation on non-foreground regions, allowing the model to focus on the true ROI.
