Performance Evaluation of Multimodal Large Language Models (LLaVA and GPT-4-based ChatGPT) in Medical Image Classification Tasks

Yuhang Guo, Zhiyu Wan

Published: 01 Jan 2024, Last Modified: 05 Jul 2025ICHI 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large language models (LLMs) have gained significant attention due to their prospective applications in medicine. Utilizing multimodal LLMs can potentially assist clinicians in medical image classification tasks. It is important to evaluate the performance of LLMs in medical image processing to potentially improve the medical system. We evaluated two multimodal LLMs (LLaVA and GPT-4-based ChatGPT) against the classic VGG in tumor classification across brain MRI, breast ultrasound, and kidney CT datasets. Despite LLMs facing significant hallucination issue in medical imaging, prompt engineering markedly enhanced their performance. In comparison to the baseline method, GPT-4-based ChatGPT with prompt engineering achieves 98%, 112%, and 69% of the baseline's performance in terms of accuracy (or 99%, 107%, and 62 % in terms of F1-score) in those three datasets, respectively. However, privacy, bias, accountability, and transparency concerns necessitate caution. Our study underscore LLMs' potential in medical imaging but emphasize the need for thorough performance and safety evaluations for their practical application.