Keywords: Zero-Shot, Fine-Tuning, Pre-Training, Multi-Modal, Self-Supervised Learning
TL;DR: Pre-trained domain models can improve CLIP-like models' performance, albeit at the cost of breaking alignment and lowering common pathology detection, while excelling at detecting low-prevalence diseases, inspiring future ensemble models.
Abstract: Recent advances in zero-shot learning have enabled the use of paired image-text data to replace structured labels, replacing the need for expert annotated datasets. Models such as CLIP-based CheXzero utilize these advancements in the domain of chest X-ray interpretation. We hypothesize that domain pre-trained models such as CXR-BERT, BlueBERT, and ClinicalBERT offer the potential to improve the performance of CLIP-like models with specific domain knowledge by replacing BERT weights at the cost of breaking the original model's alignment. We evaluate the performance of zero-shot classification models with domain-specific pre-training for detecting low-prevalence pathologies. Even though replacing the weights of the original CLIP-BERT degrades model performance on commonly found pathologies, we show that pre-trained text towers perform exceptionally better on low-prevalence diseases. This motivates future ensemble models with a combination of differently trained language models for maximal performance.