Keywords: Robustness, Medical Image Analysis, Human-AI alignment, Knowledge Injection, Vision Language Model.
Abstract: Vision–language models (VLMs) show promise for clinical decision support in ra-
diology because they enable joint reasoning over radiological images and clinical
text, thereby leveraging complementary clinical information. However, radiologi-
cal findings are long-tailed in practice, leaving some conditions underrepresented
and making zero-shot inference essential. Yet current CLIP-style medical VLMs
are sensitive to prompt variations and often lack trustworthy external knowl-
edge at inference time, which hinders reliable clinical deployment. We present
KEPIL, a prompt-robust framework that integrates curated medical knowledge
to stabilize zero-shot generalization. KEPIL comprises: (i) dynamic prompt en-
richment using ontologies with LLM assistance, (ii) a semantic-aware contrastive
loss aligning embeddings of equivalent prompt variants via a dual-embedding ob-
jective, and (iii) entity-centric report standardization to yield ontology-aligned
representations. Across seven benchmarks, KEPIL achieves state-of-the-art zero-
shot/finetuning performance in classification and segmentation; under prompt-
variation tests, it improves AUC by 6.37% on CheXpert and by 4.11% on average.
Ablations and qualitative analyses validate the contributions of enriched prompts
and semantic alignment, while attention maps highlight clinically relevant regions.
These results show that structured knowledge and robust prompt design are key to
clinically reliable radiology-facing VLMs. Code will be released at ***.
Supplementary Material: zip
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 6014
Loading