Keywords: 3D medical image pretraining, contrastive learning
TL;DR: We introduce SegVL, a unified contrastive learning framework that integrates segmentation masks into vision-language pretraining.
Abstract: Pretraining effective 3D medical image encoders is crucial for downstream tasks such as diagnosis and prognosis. Existing vision-language methods learn global semantics from paired radiology reports but often miss fine-grained cues like small lesions. We introduce SegVL, a unified contrastive learning framework that integrates segmentation masks into vision-language pretraining. SegVL aligns voxel-level features with segmentation labels using mask names as textual anchors and enhances image-text contrast via segmentation informed features. A Tversky loss addresses class imbalance, and a lightweight decoder preserves encoder capacity. Experiments show SegVL outperforms prior methods on multiple classification and segmentation benchmarks, highlighting the complementary strengths of segmentation and language supervision.
Submission Number: 34
Loading