A Balancing Act: Optimizing Classification and Retrieval in Cross-Modal Vision Models

Published: 27 Mar 2025, Last Modified: 01 May 2025MIDL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Cross-Modal Retrieval, Representation Learning, Prostate Cancer Grading, Digital Pathology
TL;DR: A straightforward and effective approach for balancing classification accuracy and vision-text alignment within cross-modal vision models.
Abstract: Despite the promising capabilities of vision-language models (VLMs) across diverse tasks, recent studies reveal that they struggle with the fundamental task of image classification. In this study, we explore leveraging state-of-the-art task-specific classification models as a foundation for VLMs, aiming to preserve strong classification performance. Specifically, we assess the impact of contrastive tuning to enable cross-modal retrieval capabilities on a Hierarchical Image Pyramid Transformer (HIPT) trained for prostate cancer grading in Whole-Slide Images (WSIs) and a ViT-Base model trained for multi-label classification on natural images. Our results demonstrate that contrastive fine-tuning creates a clear trade-off: classification accuracy rapidly deteriorates toward zero as vision-text alignment improves. By balancing the two objectives in the loss function during fine-tuning, we achieve competitive slide-level retrieval performance while maintaining classification accuracy.
Primary Subject Area: Transfer Learning and Domain Adaptation
Secondary Subject Area: Unsupervised Learning and Representation Learning
Paper Type: Methodological Development
Registration Requirement: Yes
Reproducibility: https://github.com/DIAGNijmegen/tradeoff_classification_alignment.git
Submission Number: 202
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview