Exploring Foundation Model Adaptations for 3D Medical Imaging: Prompt-Based Segmentation with xLSTM network

Abdul Qayyum; Moona Mazher; Steven Niederer

Exploring Foundation Model Adaptations for 3D Medical Imaging: Prompt-Based Segmentation with xLSTM network

Abdul Qayyum, Moona Mazher, Steven Niederer

05 Jun 2025 (modified: 09 Jun 2025)CVPR 2025 Workshop MedSegFM SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D medical image segmentation. Interactive segmentation. Vision transformers, xLSTM-UNet. SAM-Med3D, Foundation Models, User-Guided Refinement, Volumetric Data, Prompt-Based Segmentation

Abstract: Accurate segmentation of anatomical and pathological structures in 3D medical imaging is critical for effective diagnosis, treatment planning, and disease monitoring. Despite recent advances in deep learning, automated 3D medical image segmentation remains challenging due to anatomical variability, imaging artifacts, and the limited availability of annotated data. To address these issues, we present an interactive segmentation framework in the SAM-Med3D architecture with an xLSTM-UNet image encoder. Our encoder is specifically designed to capture long-range dependencies and hierarchical spatial features in volumetric medical data, improving contextual awareness while maintaining computational efficiency. We validate our approach using the CoreSet from the CVPR 2025 Foundation Models for 3D Biomedical Image Segmentation Challenge. Initial results demonstrate that our model achieves competitive performance in limited-scale testing, with DSC Final scores of 0.4855 (CT), 0.3071 (MRI), 0.4070 (PET), and 0.4458 (Ultrasound. NSD Final scores follow a similar trend, reaching 0.4992 (Ultrasound) and 0.4545 (CT). These early findings suggest strong potential for our architecture, particularly with further training on the full dataset. The proposed model supports multimodal prompts, including points and bounding boxes, allowing for flexible and intuitive user interaction a key requirement in clinical workflows. Our contributions include the development of a 3D-optimized interactive segmentation encoder, its integration into an existing foundation model framework, and an empirical evaluation that highlights the feasibility of our design. Future work will focus on full-scale training and refinement to bridge the performance gap with state-of-the-art methods.

Submission Number: 6

Loading