Keywords: MLLM, Medical, Post-Training
Abstract: Multimodal Large Language Models (MLLMs) have achieved strong performance in general visual understanding and reasoning; however, their progress in the medical domain remains constrained by the scarcity of informative multimodal medical data and the limited effectiveness of Reinforcement Learning with Verifiable Rewards (RLVR). Moreover, existing work often lacks an in-depth exploration of multimodal medical tasks.
To address these issues, during supervised fine-tuning (SFT), we jointly incorporate high-quality textual reasoning data, general multimodal data, and multimodal medical data to enhance foundational medical knowledge while preserving the base model’s reasoning capability. Furthermore, to mitigate sparse-information scenarios common in medical datasets, we synthesize reflective-pattern-injected chain-of-thought (CoT) data in addition to standard CoT, endowing the model with structured reflective reasoning and providing a strong initialization for subsequent RLVR training.
Based on this training paradigm, we introduce the InfiMed-Series, including InfiMed-SFT-3B and InfiMed-RL-3B, which achieve state-of-the-art performance across seven multimodal medical benchmarks. Notably, InfiMed-RL-3B attains an average accuracy of 59.2\%, outperforming larger models such as InternVL3-8B (57.3\%), while using only 188K SFT samples and 36K RLVR samples.
Finally, we conduct extensive experiments to explore a range of fundamental research questions regarding data composition, reasoning strategies, and training paradigms in multimodal medical models. Our findings provide meaningful insights for the future development of medical MLLMs.
Paper Type: Long
Research Area: Clinical and Biomedical Applications
Research Area Keywords: NLP Applications
Languages Studied: English,Chinese
Submission Number: 5768
Loading