Keywords: Heterogeneous Transformer, State-Space Model, Medical Image Segmentation
Abstract: Convolutional Neural Networks (CNNs) have significantly advanced medical image segmentation, offering unparalleled local feature extraction capabilities. However, CNNs face limitations in capturing long-range dependencies due to the local nature of convolutional operations. Recently, State-Space Models (SSMs), such as Mamba, have presented an efficient solution by incorporating gating, convolutions, and data-dependent filtering mechanisms for long-range interaction modeling. However, as an attention-free mechanism, SSMs are less efficient at handling variable distance token-to-token interactions compared to attention. In this paper, we introduce Hetero-UNet, a novel hybrid U-Net architecture that incorporates SSMs and attention mechanisms to map long-range dependencies. Featuring a hybrid Transformer-Mamba encoder within original U-Net architecture, it excels at extracting both local and global features. Our extensive experiments across diverse tasks—abdominal organ segmentation in CT and MR, instrument segmentation in endoscopy, and cell segmentation in microscopy—demonstrates Hetero-UNet's superior performance over previous state-of-the-art segmentation models, paving the way for hybrid long-range dependency modeling in medical imaging. The code is available at https://github.com/ZhilingYan/Hetero-UNet.
Submission Number: 68
Loading