Keywords: Vision-Language Models, Localized Classification, Zero-Shot Classification
TL;DR: We improve zero-shot chest X-ray classification by adding visual markers (arrows, boxes, circles) to guide vision-language models' attention.
Abstract: Medical image classification plays a crucial role in clinical decision-making, yet most models are constrained to a fixed set of predefined classes, limiting their adaptability to new conditions. Contrastive Language-Image Pretraining (CLIP) offers a promising solution
by enabling zero-shot classification through multimodal large-scale pretraining. However, while CLIP effectively captures global image content, radiology requires a more localized focus on specific pathology regions to enhance both interpretability and diagnostic accuracy.
To address this, we explore the potential of incorporating visual cues into zero-shot classification, embedding visual markers—such as arrows, bounding boxes, and circles—directly into radiological images to guide model attention. Evaluating across four public chest X-ray
datasets, we demonstrate that visual markers improve AUROC by up to 0.185, highlighting their effectiveness in enhancing classification performance. Furthermore, attention map analysis confirms that visual cues help models focus on clinically relevant areas, leading to
more interpretable predictions. To support further research, we use public datasets and will release our code and preprocessing pipeline, providing a reference point for future work on localized classification in medical imaging.
Primary Subject Area: Application: Radiology
Secondary Subject Area: Detection and Diagnosis
Paper Type: Validation or Application
Registration Requirement: Yes
Reproducibility: Will be provided
Visa & Travel: Yes
Submission Number: 80
Loading