LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?
Keywords: Multimodal Large Language Model, Zero Shot Learning
Abstract: Recently, Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual understanding and reasoning across various vision-language tasks.However, we found that MLLMs cannot process effectively from fine-grained medical image data in the traditional Visual Question Answering (VQA) pipeline, as they do not exploit the captured features and available medical knowledge fully, results in MLLMs usually performing poorly in zero-shot medical disease recognition.Fortunately, this limitation does not indicate that MLLMs are fundamentally incapable of addressing fine-grained recognition tasks.From a feature representation perspective, MLLMs demonstrate considerable potential for tackling such challenging problems.Thus, to address this challenge, we propose $\textbf{\textit{LLaVA-RadZ}}$, a simple yet effective framework for zero-shot medical disease recognition via utilizing the existing MLLM features.Specifically, we design an end-to-end training strategy, termed $\textit{Decoding-Side Feature Alignment Training ($\textbf{DFAT}$)}$ to take advantage of the characteristics of the MLLM decoder architecture and incorporate modality-specific tokens tailored for different modalities.Additionally, we introduce a $\textit{Domain Knowledge Anchoring Module ($\textbf{DKAM}$)}$ to exploit the intrinsic medical knowledge of large models, which mitigates the $\textit{category semantic gap}$ in image-text alignment.
Extensive experiments demonstrate that our LLaVA-RadZ significantly outperforms traditional MLLMs in zero-shot disease recognition, achieving the comparable performance to the well-established and highly-optimized CLIP-based approaches.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9153
Loading