Improving Medical Visual Instruction Tuning with Labeled Datasets

Published: 2025, Last Modified: 08 Jan 2026MedAGI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We introduce the Medical Vision Instruction-tuning Dataset (MVID), a large-scale dataset containing 558,052 samples designed to enhance the instruction-following capabilities of medical vision-language models (VLMs). Previous vision-language datasets did not include clinical data as they were generated by commercial large language models (LLMs) that did not meet the privacy and regulatory constraints. Additionally, prior instruction-tuning efforts for medical VLMs focused on figure-caption pairs from PubMed, covering a broad array of medical topics but lacking clinical applications related themes, such as knowledge, detection, and localization of common findings. To address these limitations, MVID is generated with an open LLM and incorporates a diverse range of labeled data, including classification, segmentation, and free-text tasks, equipping models with essential medical competencies. Our experiments show that medical VLMs trained on MVID outperform those trained on PubMed-based datasets on VQA benchmarks by up to 20.29% in a zero-shot setting. Furthermore, our 7 billion parameter model demonstrates stronger performance than GPT-4o on radiology tasks, highlighting the effectiveness of MVID in developing capable medical VLMs.
Loading