\section{Introduction}
Cone-Beam Computed Tomography (CBCT) is a widely used imaging modality in dentistry. It provides comprehensive 3D volumetric information and excellent visualization of the orofacial region, including jaws, teeth, nerves \cite{cbct_review}. Accurate segmentation of individual anatomical structures in CBCT images is crucial in applications such as dental diagnosis, treatment, and surgical planning \cite{clinical_cbct_review,cbct_use_tyndall2012,cbct_endo}. However, manual segmentation of CBCT scans requires specialized domain expertise and is extremely time-consuming due to their three-dimensional nature \cite{toothfairy2}. Thus, there is a strong demand for robust and efficient CBCT segmentation algorithms to improve the accuracy and efficiency of dental care and ultimately lead to better patient outcomes.

Generally, network architectures for semantic segmentation can be categorized into three: 1) Convolutional neural networks (CNN) such as U-Net \cite{unet, nnunet} and DeepLab \cite{deeplab} with translation-invariant convolutions that can effectively capture hierarchical image features and are parameter-efficient with their shared kernel weights; 2) Transformers \cite{transformer} such as SETR \cite{setr} and SwinTransformer \cite{liu2021Swin} that treat images as a sequence of patches instead of extracting image features hierarchically to capture the global information better; and 3) Hybrid CNN-Transformer architectures such as nnFormer \cite{nnformer} and SwinUNETR \cite{swinunetr} that attempt to exploit the best of both worlds by combining their architectures.

While the hybrid architectures have improved the global feature capabilities of CNNs, transformers are highly resource-intensive due to the attention mechanism which scales quadratically with input size. This limitation reduces their suitability for healthcare applications, which often involve high-resolution 3D data and constrained computational resources in real-world settings. Recently, structured state space sequence models \cite{s4}, particularly the Mamba \cite{mamba} model, have emerged as an efficient and effective alternative to the transformer model. By selectively capturing relevant input features and scaling linearly with input size, Mamba outperforms transformers across multiple modalities \cite{mamba,s4nd,ssvm}. U-Mamba \cite{umamba} presented the first work to leverage Mamba for image segmentation, achieving superior performance and surpassing transformer-based networks in a range of medical image segmentation tasks. More recently, Dao \etal~\cite{mamba2} proposed Mamba2, based on the structured state-space duality (SSD) framework, which dramatically improves speed without weakening its performance.

In this paper, we propose U-Mamba2, a hybrid CNN-SSD architecture for 3D image segmentation. U-Mamba2 extends the previous U-Mamba model \cite{mamba} by leveraging the Mamba2 SSD framework that simplifies the Mamba architecture with stronger constraints imposed on the hidden space structure. Mamba2 introduced several architectural changes to enable tensor and sequence parallelism, providing a significant speedup without compromising performance. Similar to U-Mamba, U-Mamba2 can effectively extract local spatial features via CNN and capture global long-range dependencies with Mamba2. 
We implement interactive click prompts with cross-attention blocks and incorporate several domain knowledge to address key challenges of dental anatomy segmentation in CBCT.
Our extensive experiments demonstrate the superior performance of U-Mamba2 for CBCT segmentation, outperforming previous alternatives and achieving first place
for Tasks 1 and 2 of the ToothFairy3 challenge.
