\section{Experiment Results}
We implement U-Mamba2 with the nnU-Net \cite{nnunet} framework. We perform a 9:1 stratified train-validation split on the ToothFairy3 dataset to ensure the same proportion of data sources (with different fields of view and imaging machines) in the train and validation datasets. All models are pre-trained with SSL (\Cref{subsec:pretrain}) following the original training configuration of DAE \cite{dae}. Each model employs seven encoder-decoder stages, an input patch size of 128x256x256, the native voxel spacing of 0.3mm\textsuperscript{3} (leading to no downsampling or upsampling during model training and inference), and a batch size of 1. During training, we disable left/right mirroring augmentation for all models except U-Mamba2, while during inference, we use sliding window inference with a tile size of 0.5 and disable left/right mirroring in TTA for all models, including U-Mamba2, for fair comparison. Other hyperparameters follow the default values of nnU-Net. Model training and time computation are performed on an RTX4090 GPU. We evaluate the models with the Dice coefficient, the Hausdorff Distance at the 95th percentile (HD95), and the average inference time in seconds, where lower is better for all metrics except Dice.

\subsection{Quantitative Results}
\begin{table}[tb]
    \caption{Validation set evaluation metrics. $\dag$ indicates applying post-processing.}
    \label{tab:quantitative_results}
    \centering \setlength{\tabcolsep}{4pt}%
    \resizebox{\linewidth}{!}{
    \begin{tabular}{l|ccccc|ccccc}
         \hline
         \multirow{2}{*}{Model} & \multicolumn{5}{c|}{Task 1} & \multicolumn{5}{c}{Task 2} \\
         \cline{2-11}
          & Dice & HD95 & Dice$\dag$ & HD95$\dag$ & Time & Dice & HD95 & Dice$\dag$ & HD95$\dag$ & Time \\
         \hline
         SwinUNETR \cite{swinunetr} & 0.858 & 48.86 & 0.874 & 40.09 & 7.23 & - & - & - & - & -\\
         nnU-Net ResE \cite{nnunet} & 0.861 & 45.28 & 0.887 & 32.05 & \textbf{6.20} & 0.901 & 1.98 & 0.905 & 1.71 & \textbf{5.06} \\
         U-Mamba \cite{umamba} & 0.865 & 42.06 & 0.896 & 25.88 & 6.98 & 0.903 & 1.65 & \textbf{0.913} & 1.58 & 5.88 \\
         U-Mamba2 (ours) & \textbf{0.873} & \textbf{41.08} & \textbf{0.908} & \textbf{21.35} & 6.81 & \textbf{0.905} & \textbf{1.63} & \textbf{0.913} & \textbf{1.57} & 5.70 \\
         \hline
    \end{tabular}
    }
\end{table}

\Cref{tab:quantitative_results} compares the proposed U-Mamba2 with nnU-Net ResE \cite{nnunet}, U-Mamba \cite{umamba} which utilizes the original Mamba layer \cite{mamba}, and SwinUNETR \cite{swinunetr} on the ToothFairy3 dataset. For Task 2, we incorporate a point prompt encoder to nnU-Net ResE and U-Mamba at the bottleneck stage, similar to U-Mamba2. U-Mamba2 outperforms all benchmark models, achieving the best mean Dice score of 0.873 and 0.905 for Tasks 1 and 2, respectively. After applying post-processing, U-Mamba2 further improves to a mean Dice score of 0.908 and 0.913 for Tasks 1 and 2, respectively. U-Mamba2 delivers the best performance with an average inference time of 6.81 and 5.70 seconds per scan for the two tasks, demonstrating a slight speedup over U-Mamba.

\begin{table}[tb]
    \caption{Ablation Study of U-Mamba2 for the validation set of Task 1. ILN indicates the metrics for the left and right incisive nerves and the lingual nerve.}
    \label{tab:umamba2_abl}
    \centering \setlength{\tabcolsep}{5pt}%
    \resizebox{0.85\linewidth}{!}{
    \begin{tabular}{ccc|cccc}
         \hline
         \makecell{Label\\ Smoothing} & \makecell{Weighted\\ Loss} & \makecell{L/R\\ Mirroring} & Dice & HD95 & Dice (ILN) & HD95 (ILN) \\
         \hline
         \crossmark & \crossmark & \crossmark & 0.867 & 42.36 & 0.617 & 38.41 \\
         \checkmark & \crossmark & \crossmark & 0.872 & 40.74 & 0.628 & 38.15 \\
         \crossmark & \checkmark & \crossmark & 0.870 & 41.31 & 0.635 & 37.99 \\
         \crossmark & \crossmark & \checkmark & 0.871 & 41.20 & 0.642 & 36.48 \\
         \checkmark & \checkmark & \checkmark & \textbf{0.873} & \textbf{41.08} & \textbf{0.646} & \textbf{35.21} \\
         \hline
    \end{tabular}
    }
\end{table}

Furthermore, we perform an ablation study on U-Mamba2 by individually applying the dental domain knowledge introduced in \Cref{subsec:domain_know}, excluding the post-processing step (See \Cref{tab:quantitative_results} for post-processing results). \Cref{tab:umamba2_abl} shows that these techniques lead to small performance improvements. In particular, the weighted loss and left/right mirroring techniques improve the mean Dice score on the three tiny structures, \ie the left and right incisive nerves and the lingual nerve (ILN) from 0.617 to 0.635 and 0.642, respectively. When all three techniques are applied, U-Mamba2 achieves the best performance, with a mean Dice score of 0.873 and 0.646 for all classes and the ILN classes, respectively.

\subsection{Qualitative Results}
\begin{figure}[tb]
    \begin{subfigure}{0.48\textwidth}
        \centering 
        \includegraphics[width=0.53\linewidth]{fig/best_3d_gt.png}
        \includegraphics[width=0.45\linewidth]{fig/best_2d_gt.png}
    \end{subfigure} \hfill
    \begin{subfigure}{0.48\textwidth}
        \centering 
        \includegraphics[width=0.53\linewidth]{fig/best_3d_pred.png}
        \includegraphics[width=0.45\linewidth]{fig/best_2d_pred.png}
    \end{subfigure}
    
    \begin{subfigure}{0.48\textwidth}
        \centering 
        \includegraphics[width=0.53\linewidth]{fig/worst_3d_gt.png}
        \includegraphics[width=0.45\linewidth]{fig/worst_2d_gt.png}
        \caption{Ground Truth}
    \end{subfigure} \hfill
    \begin{subfigure}{0.48\textwidth}
        \centering 
        \includegraphics[width=0.53\linewidth]{fig/worst_3d_pred.png}
        \includegraphics[width=0.45\linewidth]{fig/worst_2d_pred.png}
        \caption{Prediction}
    \end{subfigure}
    \caption{Qualitative results of U-Mamba2 on the validation set of Task 1. The 3D render and a representative 2D slice are shown for: (Top) the best scoring case and (Bottom) the worst scoring case.}
    \label{fig:qualitative}
\end{figure}

\cref{fig:qualitative} visualizes the qualitative comparison between the ground truth and our model's predictions of the scans with the highest and lowest Dice score in the validation set, in the top and bottom rows, respectively. We observe that in most cases, U-Mamba2 produces precise segmentation predictions, showcasing the effectiveness of incorporating dental domain knowledge into the model design. Furthermore, we observe that U-Mamba2 can accurately localize the three tiny structures (ILN), producing visually acceptable segmentations. In the worst-case scenario, although the scan is imperfect due to image artifacts caused by metallic objects, false positives are primarily confined around the image edge or confusion between the actual tooth and the crown or implant, underscoring U-Mamba2's robustness under noisy conditions.

\subsection{Optimizing Speed in Sliding Window Inference}
\begin{figure}[tb]
    \begin{subfigure}{0.48\textwidth}
        \centering 
        \includegraphics[width=\linewidth]{fig/ts_abl.png}
    \end{subfigure} \hfill
    \begin{subfigure}{0.43\textwidth}
        \centering 
        \includegraphics[width=\linewidth]{fig/mirror_abl.png}
    \end{subfigure}
    \caption{(Left): Effect of the tile size on the metrics with `0,1' mirror axes in TTA. (Right): Effect of various mirror axes combinations in TTA on the metrics when tile size is set to 0.9. Axis definition: `0' is superior/inferior, `1' is anterior/posterior, and `2' is left/right.}
    \label{fig:speed_tradeoff}
\end{figure}
As the inference time is an important metric in the ToothFairy 3 challenge, we optimize the sliding window inference parameters to improve speed without significantly deteriorating model accuracy. Specifically, we optimize the tile size parameter, where a larger value results in less border overlap during sliding window inference, and the mirror axes combinations in TTA. \cref{fig:speed_tradeoff} shows the tradeoff between Dice score and inference time for different tile sizes and mirror axes combinations in TTA. By setting the tile size to 0.9, we can reduce the inference time by 12.9\% with a negligible drop of only 0.002 Dice score. Moreover, \cref{fig:speed_tradeoff} also demonstrates that the optimal mirror axes combination is `1,2', representing anterior/posterior and left/right, offering the best Dice score with an average inference time of only 6.02 seconds. We believe this is due to the larger spatial dimension in these axes containing more information.

\subsection{Final Challenge Submission}
For the final submission, we extended training to 1500 epochs using all available data with a batch size of 2 and increased the input patch size to 160x288x288. During inference, we use a sliding window inference with a tile size of 0.9 and enable mirroring in the anterior/posterior and left/right axes during TTA.
The final U-Mamba2 model achieved a mean Dice of 0.84, HD95 of 38.17, with an average inference time of 40.58s, computed on the Grand Challenge platform using a T4 GPU, securing first place in Task 1 of the ToothFairy3 challenge with a 3.1 overall ranking while obtaining first place in Task 2 with a mean Dice, HD95 and overall rank of 0.87, 2.15 and 1.66, respectively, on the hidden test set.
