SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation
Keywords: Medical Multimodality QA
Abstract: Many medical MLLMs post strong scores on curated VQA-style benchmarks yet still struggle on real clinical questions because their training/supervision expose them to too little clinically grounded knowledge and prevailing benchmarks contain too few diagnostic-reasoning Q&A items. We introduce \textbf{SemiHVision}, a semi-human–validated multimodal instruction dataset built with a multimodal retriever; to our knowledge, this is the first dataset to leverage a unified image--text retriever to integrate real-world clinical information into data construction, thereby strengthening models' clinical diagnostic reasoning. Our pipeline retrieves image- and context-relevant evidence and performs retrieval-augmented synthesis to produce clinically grounded instruction Q&A and captions across major modalities (X-ray, CT, MRI, ultrasound, histopathology), while standardizing heterogeneous annotations into a training-ready schema. For model fine-tuning, we train \textbf{SemiHVision-8B-AN}, surpassing public medical models like HuatuoGPT-Vision-34B (79.0% vs. 66.7%) and private general models like Claude3-Opus (55.7%) on standard benchmarks (SLAKE, VQA-RAD). On the \textbf{JAMA Clinical Challenge}---a benchmark that directly probes diagnostic reasoning aligned with clinical practice---we evaluate SemiHVision-AN and it achieves a GPT-4 rubric score of 1.29, exceeding HuatuoGPT-Vision-34B (1.13) and Claude3-Opus (1.17), indicating the effectiveness of SemihVision datasets.
Primary Area: datasets and benchmarks
Submission Number: 3574
Loading