Global and Local Vision-Language Alignment for Few-Shot Learning and Few-Shot OOD Detection

Published: 2025, Last Modified: 12 Nov 2025MICCAI (5) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Training data in the medical domain is often limited due to privacy concerns and data scarcity. In such few-shot settings, neural network models are prone to overfitting, resulting in poor performance on new in-distribution (ID) data and misclassification of out-of-distribution (OOD) data as learned ID diseases. Existing research treats these two tasks (few-shot learning and few-shot OOD detection) separately, and no prior work has explored a unified approach to simultaneously improving the performance of both tasks. To bridge this gap, we propose a novel framework based on CLIP that jointly enhances ID classification accuracy and OOD detection performance. Our framework consists of three key components: (1) a visually-guided text refinement module, which refines text representations of each disease utilizing disease-relevant visual information; (2) a local version of supervised contrastive learning, which enhances local representation consistency among disease-relevant regions while improving ID-OOD separability; and (3) a global and local image-text alignment strategy, which adaptively combines the global and local similarity measurements for better image-text alignment. Extensive experiments demonstrate that our method outperforms the best methods specifically-tailored for both tasks, achieving new state-of-the-art performance. The source code is available at https://openi.pcl.ac.cn/OpenMedIA/GLAli.
Loading