Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models

Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models

ACL ARR 2025 May Submission1710 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recently, reinforcement learning (RL)-based tuning has shifted the trajectory of Multimodal Large Language Models (MLLMs), particularly following the introduction of Group Relative Policy Optimization (GRPO). However, directly applying it to medical tasks remains challenging for achieving clinically grounded model behavior. Motivated by the need to align model response with clinical expectations, we investigate four critical dimensions that affect the effectiveness of RL-based tuning in medical visual question answering (VQA): base model initialization strategy, the role of medical semantic alignment, the impact of length-based rewards on long-chain reasoning, and the influence of bias. We conduct extensive experiments to analyze these factors for medical MLLMs, providing new insights into how models are domain-specifically fine-tuned. Additionally, our results also demonstrate that GRPO-based RL tuning consistently outperforms standard supervised fine-tuning (SFT) in both accuracy and reasoning quality.

Paper Type: Short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Vision Language Model, fine-tuning, Medical VQA, Reinforcement Learning, Group Relative Policy Optimization

Contribution Types: NLP engineering experiment, Reproduction study, Publicly available software and/or pre-trained models, Surveys

Languages Studied: English

Keywords: Vision Language Model, fine-tuning, Medical VQA, Reinforcement Learning, Group Relative Policy Optimization

Submission Number: 1710

Loading