TGUMIAD: Text-Guided Unified Model for Medical Image Anomaly Detection

Published: 18 Nov 2025, Last Modified: 18 Nov 2025SPARTA_AAAI2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Unified Anomaly Detection, Vision–Language Models
TL;DR: A unified vision–language model that uses CLIP-guided prompts to produce interpretable anomaly heatmaps across medical imaging, working well even with few labels.
Abstract: Accurate anomaly detection in medical imaging is critical for clinical decision-making, yet many methods rely on disease-specific models and extensive labels. We present \textbf{TGUMIAD}, a unified vision--language framework that combines a frozen CLIP image encoder and CLIP text encoder with explicit cross-modal fusion and a denoising Transformer decoder to deliver robust, interpretable anomaly detection across retina, brain tumor, and liver tumor benchmarks. Our design emphasizes \emph{human-in-the-loop use}, \emph{explainability} (prompt-guided heatmaps), and \emph{clinical usability} (compact model size and fast inference). Experiments show strong image- and pixel-level AUROC, especially in few-shot settings, indicating practical value when annotated data are scarce. We discuss deployment constraints, fairness/robustness under shifts, and how our interface supports clinician oversight in real workflows.
Submission Number: 13
Loading