Language-guided Vision Model for Pan-cancer Segmentation in Abdominal CT Scans

Published: 31 Mar 2025, Last Modified: 31 Mar 2025FLARE 2024 withMinorRevisionsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: CLIP VLM CT
Abstract: Accurate segmentation of abdominal tumors is critical for clinical diagnosis, disease research, and treatment planning. Since deep learning-based segmentation techniques typically require a large amount of labeled data for training, it is crucial to develop precise segmentation methods that rely on smaller labeled datasets in medical image analysis. Recently, the advent of pre-trained vision-language foundation models, such as CLIP, has opened new possibilities for general computer vision tasks. Leveraging the generalization capabilities of these pre-trained models in downstream tasks, like segmentation, can achieve remarkable performance with relatively limited labeled data. However, the exploration of these models for tumor segmentation remains limited. Hence, in this paper, we propose a novel framework called the Language-guided Vision Model. Our approach employs the pre-trained CLIP as a powerful feature extractor to generate segmentations of 3D CT scans while adaptively aggregating cross-modal representations of text and images. On validation of FLARE 2024 challenge, our method achieves mean DSC of 43% and mean NSD of 38% on validation leaderboard for tumor segmentation.
Submission Number: 14
Loading