Keywords: Medical anomaly detection, vision–language models, CLIP
TL;DR: MCMIAD is a unified, lightweight CLIP-based framework that enables efficient, prompt-guided, and modality-agnostic medical anomaly detection.
Abstract: Accurate anomaly detection in medical imaging is critical for clinical decision-making, yet many deployed systems still rely on disease-specific models and large labeled datasets.
We present \textbf{MCMIAD}, a unified vision--language framework that couples a frozen EfficientNet image encoder and a CLIP text encoder with a shallow cross-modal fusion block and a denoising Transformer decoder.
The framework is designed around three goals: \emph{modality-agnostic deployment}, \emph{prompt-guided explainability}, and \emph{practical efficiency}.
MCMIAD keeps the vision backbone frozen and trains only a compact reconstruction head, making the method lightweight enough for typical clinical GPUs.
On the BMAD benchmark, MCMIAD achieves strong image- and pixel-level AUROC across retina OCT, brain tumor MRI, and liver tumor CT, with particularly notable gains in one-shot settings where only a single normal example per category is available.
Its anomaly heatmaps align with expected clinical regions of interest, supporting human-in-the-loop review.
We further analyze the contributions of CLIP-guided cross-attention, model size, and we discuss robustness, fairness, and deployment considerations relevant to real-world clinical workflows.
Primary Subject Area: Unsupervised Learning and Representation Learning
Secondary Subject Area: Detection and Diagnosis
Registration Requirement: Yes
Reproducibility: Yes, we will publish the code and intruction guide for reproduce.
Visa & Travel: No
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 119
Loading