\section{Method}
This section describes our method, designed in the scope of the two tasks of the ToothFairy3 \cite{toothfairy3,toothfairy3-ia,toothfairy3-tmi} challenges. Task 1 extends the previous ToothFairy2 challenge by adding segmentation of pulps, incisive nerves, and the lingual foramen to the existing 42 anatomy classes (\eg jaws, sinuses, and 32 teeth), and includes inference time to the evaluation criteria. The dataset contains 532 CBCT scans with shapes ranging from $(170, 272, 345)$ to $(298, 512, 512)$, along with the segmentation labels of 46 anatomy classes. On the other hand, Task 2 focuses on the interactive segmentation of the inferior alveolar nerves, allowing interactive user clicks as prompts to segment the inferior alveolar nerves. \cref{fig:overall_arch} shows the overall structure of the U-Mamba2 model and the details of the U-Mamba2 block.

\begin{figure}[tb]
    \centering
    \begin{subfigure}{0.66\textwidth}
        \centering \includegraphics[width=\linewidth]{fig/overall-diagram.drawio.png}
    \end{subfigure} \hfill
    \begin{subfigure}{0.33\textwidth}
        \centering \includegraphics[width=\linewidth]{fig/umamba2.drawio.png}
    \end{subfigure}
    \caption{(Left): Overall architecture of the U-Mamba2 model. U-Mamba2 employs the encoder-decoder framework with residual connections between each stage and the U-Mamba2 block in the bottleneck. The number of stages is configurable depending on the dataset input size. (Right): The U-Mamba2 block contains the SSD-based Mamba2 and an optional click position encoder and cross-attention blocks. The output of Mamba2 follows the solid line for tasks without interactive clicks, while it follows the dashed line when clicks are present.}
    \label{fig:overall_arch}
\end{figure}

\subsection{U-Mamba2: Integrating Mamba2 to U-Net}
Inspired by U-Mamba \cite{umamba}, we propose U-Mamba2, which integrates the strengths of U-Net and Mamba2 to efficiently capture global information. As shown in \cref{fig:overall_arch}, U-Mamba2 follows a structure similar to U-Net, with a symmetric encoder-decoder architecture that extracts image features across multiple scales. Residual connections between the encoder and decoder blocks at each stage facilitate the fusion of low-level and high-level features. As convolutional operations are inherently localized,
we leverage Mamba2 to enhance the vanilla U-Net's limited capability to model global long-range dependencies in images by treating the features as long sequences. Similar to Mamba, Mamba2 scales linearly with sequence length but leverages the SSD framework to constrain the internal recurrent structure and uses matrix multiplication instead of selective scan, thereby improving efficiency through parallelism.

The encoder blocks consist of two consecutive Residual blocks \cite{resnet}, followed by a strided downsampling convolution, while the decoder blocks are composed of Residual blocks and transposed convolutions for upsampling. In the U-Mamba2 block, image features of shape $(B,C,H,W,D)$ are reshaped and transposed to $(B,T,C)$ where $B$ denotes the batch size, $C$ the number of channels, and $H,W,D$ are the spatial dimensions, with $T=H\times W\times D$. Then, Layer Normalization \cite{layernorm} is applied to the features before they are passed to Mamba2 to capture the global contexts. The output features are then reshaped and transposed back to $(B,C,H,W,D)$. We apply the U-Mamba2 block exclusively in the bottleneck stage, as it results in the best empirical performance for 3D computed tomography modality, consistent with Ma \etal~\cite{umamba}. Finally, Softmax is applied to the final decoder feature to produce the segmentation class probabilities, and U-Mamba2 is trained with a combination of cross-entropy loss and Dice loss.

\subsection{Cross-Attention with Point Encoder}
We introduce an optional interactive branch to enable the model to incorporate user-provided clicks to refine the output of U-Mamba2, improve accuracy, and support human-in-the-loop collaboration. Following the SAM2 framework \cite{sam2}, this branch employs a position embedding and two cross-attention blocks, as illustrated in \cref{fig:overall_arch}. The optional clicks data contain a varying number of $N$ clicks consisting of the X, Y, Z coordinates, and class labels. These clicks are first encoded with a learnable position embedding depending on their spatial positions and class labels. Next, the embedded click prompts and the output features of Mamba2 are fused through two-way cross-attention blocks as queries and keys, respectively. The cross-attention blocks, followed by Layer Normalization, are repeated twice to allow the model to integrate click information with the image features. The final output of the cross-attention block is then reshaped and transposed back to the original spatial dimensions.

\subsection{Pre-training with Self-Supervised Learning}
\label{subsec:pretrain}
Recent studies \cite{dae,Tang2021SelfSupervisedPO} have shown that pre-training models on large datasets with self-supervised learning (SSL) produces stronger models that can extract meaningful feature representations, leading to improved performance of downstream segmentation tasks, particularly when there is limited labeled data. 

In addition to the 532 scans of ToothFairy3, we utilize the STS-3D-Tooth \cite{sdtooth} dataset consisting of 371 unlabeled CBCT scans to pre-train U-Mamba2 with the disruptive autoencoder (DAE) \cite{dae} framework. DAE aims to reconstruct the original 3D volume after it is corrupted by several low-level perturbations. Specifically, we corrupt the input volume by randomly applying local masks, downsampling, and adding Gaussian noise to the input. The disrupted input is then passed through the U-Mamba2 to learn to reconstruct the original image with an L1 loss function. The pre-trained weights are then used to initialize U-Mamba2 (except for the weights of the optional interactive branch and the final segmentation layer) for effective downstream training.

\subsection{Domain Knowledge for Dental Anatomy Segmentation}
\label{subsec:domain_know}
\paragraph{Label Smoothing of Related Anatomies.}
Anatomies in the orofacial region are not always distinct and often share similar shapes and properties. For instance, similar tooth types (incisor, canine, molar, premolar) between left-right counterparts, as well as the inferior alveolar and incisive nerves, exhibit close structural relationships. Therefore, to guide the model in recognizing similar classes and their spatial relationship, we introduce label smoothing for related anatomies instead of learning with hard one-hot labels. For each pixel with class $k$, we set the $k$-th class's target probability to $0.9$ and distribute the remaining $0.1$ evenly across the related classes. Specifically, for each voxel with a ground truth class label, $k$, and a set of related classes, $S_{r}$, we first initialize a zero vector, $p$, as the soft label, then set $p_k=0.9$ and $p_r = \frac{0.1}{|S_r|}, \forall r \in S_r$. We apply this strategy to all anatomies with left-right counterparts, neighboring teeth, and to the inferior alveolar and incisive nerves.

\paragraph{Weighted Loss for Tiny Structures.}
ToothFairy3 introduced three additional classes corresponding to the left and right incisive nerves and the lingual foramen, which house thin, sensitive nerves in the mandible. These structures are considerably smaller than other anatomies in the dataset. We account for the volume differences by applying a class weight of $10$ to these three tiny classes, so that their contribution to the overall loss is not overshadowed by larger anatomies.

\paragraph{Left-Right Mirroring Augmentation.}
The findings of the previous ToothFairy2 challenge \cite{toothfairy2,nnunet_tf2} showed that left-right mirroring augmentation can degrade the model’s capability to reliably differentiate the left/right orientation. In dentistry, even dentists may struggle to identify a horizontally-flipped 2D image reliably without visual cues \cite{xray_leftright}, due to the structural symmetry between left/right anatomies in the sagittal plane. However, we can exploit this anatomical symmetry with careful pre-processing and post-processing, enabling left-right mirroring augmentation without reducing model performance. We propose to swap the class labels of anatomies opposite to the sagittal plane whenever left-right mirroring occurs during data augmentation (\eg `Upper left canine' and `Upper right canine'). Additionally, we also switch the predicted logits of the corresponding left/right anatomies if the image is mirrored in the left-right axis during test-time augmentation (TTA). With proper processing during training and inference, the number of possible axes combinations for mirroring augmentation is expanded from 3 to 7, substantially increasing the generalization capabilities and performance of U-Mamba2.

\paragraph{Post-processing.}
We incorporate anatomical priors of the orofacial region that voxels belonging to the same anatomy should be connected and not separated into blobs, as a post-processing step. Unlike the first-place solution \cite{nnunet_tf2} of the previous ToothFairy2 challenge, we perform post-processing to remove small predictions that are likely false positives based on the volume of the computed connected components \cite{cc3d} instead of the total volume of each class. Specifically, we select the threshold as the 0.5th percentile of the connected components' volume computed using the ground truth for each class. Importantly, this threshold is determined through the statistics of the ground truth rather than model predictions, ensuring that it is not model-specific. The threshold for each class is pre-computed using the entire ToothFairy3 training dataset.
