The Cost of Consistency: Why Cross-Plane Contrastive Learning Fails to Bridge the Gap Between MedSAM-3 and nnU-Net
Keywords: Vision Foundation Models, 3D Medical Segmentation, Negative Results, Cross-Plane Consistency, Computational Efficiency
Abstract: While Vision Foundation Models (VFMs) like SAM-3 and their Agentic variants (e.g., MedSAM-3 with Gemini) excel in 2D tasks, we demonstrate they significantly underperform against traditional nnU-Nets in 3D volumetric medical segmentation. The lack of native 3D spatial consistency in VFMs necessitates complex post-processing or architectural adaptations. In this work, we attempt to bridge this gap using a Cross-Plane Contrastive Loss framework to enforce volumetric coherence. We report a negative result: the requirement to process three orthogonal views simultaneously introduces a computational bottleneck that makes iterative fine-tuning unfeasible in resource-constrained environments. We conclude that despite the semantic capabilities of Large Multimodal Models, lightweight, consistency-aware 3D architectures remain the efficient "gold standard" for volumetric precision.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 66
Loading