Keywords: Medical Imaging, Multimodal, 3D Interactive Segmentation
TL;DR: ENSAM is a fast, lightweight 3D medical image segmentation model, validated on multiple imaging types.
Abstract: We present ENSAM (Equivariant, Normalized, Segment Any-
thing Model), a lightweight and promptable model for universal medical
image segmentation in 3D. Designed for interactive use and constrained
computational settings, ENSAM is trained from scratch on less than
5,000 images using a single GPU. The model integrates a SegResNet-
based encoder with a prompt encoder and mask decoder in a U-Net-style
configuration, featuring cross-attention at the latent level. To improve
training speed and efficiency, we incorporate relative positional encoding,
normalized attention, and the Muon optimizer. Evaluated on a diverse
validation set spanning CT, MRI, PET, ultrasound, and microscopy, EN-
SAM achieves competitive performance with an average AUC dice score
of 1.948 across five simulated user interactions while requiring signifi-
cantly fewer computational resources than existing foundation models.
Ablation studies confirm the benefits of our architectural and optimiza-
tion choices, suggesting ENSAM as a scalable and efficient foundation
for future medical image segmentation research.
Submission Number: 8
Loading