ENSAM: an efficient foundation model for interactive segmentation of 3D medical images

05 Jun 2025 (modified: 09 Jun 2025)CVPR 2025 Workshop MedSegFM SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Medical Imaging, Multimodal, 3D Interactive Segmentation
TL;DR: ENSAM is a fast, lightweight 3D medical image segmentation model, validated on multiple imaging types.
Abstract: We present ENSAM (Equivariant, Normalized, Segment Any- thing Model), a lightweight and promptable model for universal medical image segmentation in 3D. Designed for interactive use and constrained computational settings, ENSAM is trained from scratch on less than 5,000 images using a single GPU. The model integrates a SegResNet- based encoder with a prompt encoder and mask decoder in a U-Net-style configuration, featuring cross-attention at the latent level. To improve training speed and efficiency, we incorporate relative positional encoding, normalized attention, and the Muon optimizer. Evaluated on a diverse validation set spanning CT, MRI, PET, ultrasound, and microscopy, EN- SAM achieves competitive performance with an average AUC dice score of 1.948 across five simulated user interactions while requiring signifi- cantly fewer computational resources than existing foundation models. Ablation studies confirm the benefits of our architectural and optimiza- tion choices, suggesting ENSAM as a scalable and efficient foundation for future medical image segmentation research.
Submission Number: 8
Loading