Keywords: small language models, vision language models, retrieval augmented generation, AI in radiology, clinical decision support, Generative AI in Medical Imaging, Multimodal AI for Healthcare
TL;DR: A training-free approach that improves medical image classification by combining specialized medical image encoders with few-shot prompting across X-ray, CT, and MRI modalities, leveling the playing field for small language models in healthcare
Abstract: The rapid advancement of artificial intelligence in healthcare has made automated medical image analysis increasingly crucial for improving diagnostic accuracy. Large Vision Language Models (VLMs) show promise in understanding medical imagery, but their reliance on static training data often leads to outdated or inaccurate information. Current approaches to medical image classification lack the specialized understanding required for complex medical diagnostics, relying on either text-based retrieval or general-purpose image encoders. We address these limitations by developing a novel training-free retrieval-augmented generation approach that combines a specialized medical image encoder with few-shot learning across multiple imaging modalities (X-ray, CT, and MRI). Our experiments across three diverse medical imaging datasets demonstrate substantial improvements in classification performance, with F1 score gains up to 142% for state-of-the-art VLMs and 250% for smaller deployable models while requiring only 3-5 retrieved reference images, leveling the playing field for on-premise clinical applications of smaller large language models.
Submission Number: 32
Loading