Patch-MoE Mamba: A Patch-Ordered Mixture-of-Experts State Space Architecture for Medical Image Segmentation
Keywords: Vision Mamba, State Space Models, Mixture-of-Experts, Medical Image Segmentation
TL;DR: Vision Mamba; state space models; mixture-of-experts; medical image segmentation
Abstract: Advanced convolutional neural networks (CNNs) and Transformer-based architectures currently achieve state-of-the-art performance in medical image segmentation. However, CNNs have limited capacity to model long-range dependencies, while Transformers incur at least quadratic computational and memory complexity in the number of tokens, which can hinder their deployment in resource-constrained clinical settings and make model training and tuning more demanding. Recently, state space models (SSMs), such as Vision Mamba, have gained attention for their ability to capture global dependencies with linear complexity in sequence length. Despite promising results, existing Mamba-based segmentation networks (e.g., VM-UNetV2) still face two key challenges for medical image segmentation: (1) pixel-wise scanning along fixed directions does not sufficiently preserve or exploit local 2D spatial structure, and (2) feature fusion across scan directions typically relies on simple summation, which fails to adapt to varying object sizes and shapes, leading to inaccurate boundary localization and incomplete object masks.
To address these limitations, we propose \textbf{Patch-MoE Mamba}, a patch-ordered mixture-of-experts (MoE) state space architecture for medical image segmentation. First, we develop a hierarchical, patch-ordered scanning mechanism that partitions feature maps into local patches and applies directional scanning with multiple patch sizes at different stages, thereby better preserving spatial neighborhoods and capturing multi-scale spatial context. Second, we introduce a new MoE-based fusion module that adaptively combines the output signals from multiple directional Mamba scanners. This module integrates four directional scanners with a learnable concatenation expert and incorporates a residual summation of all directional outputs, which stabilizes expert weight computation and yields more discriminative fused feature representations. Extensive experiments on five public polyp segmentation benchmarks and the ISIC 2017 and 2018 skin lesion segmentation datasets demonstrate the effectiveness and generality of the proposed Patch-MoE Mamba.
Primary Subject Area: Segmentation
Secondary Subject Area: Application: Endoscopy
Registration Requirement: Yes
Visa & Travel: Yes
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 60
Loading