Keywords: Biomedical Vision-Language Models, Medical Image Classification
Abstract: Effective adaptation of Vision–Language Models (VLMs) to biomedical tasks remains challenging due to a substantial semantic gap between general knowledge and domain-specific expertise. Domain-specific models such as BiomedCLIP narrow this gap; however, prevailing prompt-learning methods collapse diverse text embeddings into a single prototype, discarding distributional information. We introduce vMF Distribution Semantic Alignment (VDSA), which models each class with a von Mises–Fisher distribution on the unit hypersphere and aligns images to the entire distribution rather than a single prototype. We further derive a closed-form upper bound to the expected contrastive loss, yielding a sampling-free objective that is implicitly equivalent to aligning against an infinite prompt ensemble with minimal overhead. Experiments on multiple biomedical benchmarks show that VDSA consistently improves few-shot adaptation and generalization to unseen classes, providing a robust recipe for adapting specialized VLMs.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 3067
Loading