Keywords: Medical Vision-Language Models; Structured Visual Reasoning; Reinforcement Learning; Self-Corrective Training
Abstract: Reinforcement learning (RL) can improve interpretability in medical vision-language models (VLMs), but medical visual reasoning remains challenging without structured guidance. Existing supervised fine-tuning and reinforcement learning (SFT+RL) approaches often learn task-specific image-to-answer mappings, leading to misalignment between visual evidence and textual reasoning and resulting in shortcut
reasoning. To address the above challenges, we propose MEDSAGE, a medical VLMs framework built upon structured reasoning sequences. MEDSAGE introduces a structured path enhancement strategy that formulates medical visual reasoning as a sequence of clinically meaningful stages—localization, visual analysis, knowledge matching, and final decision—thereby guiding models to explore reasonable reasoning paths. We construct two training datasets, \textbf{SAGE-sft20K} and \textbf{SAGE-rl10K}, to support this training paradigm. Within this framework, SFT induces consistent structured reasoning across tasks, while self-corrective RL further improves answer correctness by enabling the model to revise erroneous predictions during training. encouraging self-check guided correction of erroneous predictions. Experiments on five medical benchmark datasets show that MEDSAGE achieves competitive or improved performance across diverse medical VQA benchmarks. Additional analyses further examine robustness and reasoning faithfulness.
Paper Type: Long
Research Area: Clinical and Biomedical Applications
Research Area Keywords: Medical Vision-Language Models; Structured Visual Reasoning; Reinforcement Learning; Self-Corrective Training
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 5196
Loading