MedCCO: Unleashing Open-Ended Reasoning in Medical Multi-modal Language Models via Curriculum Reinforcement Learning

Shaohao Rui; Weijie Ma; Kaitao Chen; Xiaosong Wang

MedCCO: Unleashing Open-Ended Reasoning in Medical Multi-modal Language Models via Curriculum Reinforcement Learning

Shaohao Rui, Weijie Ma, Kaitao Chen, Xiaosong Wang

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, VLM, Medical Reasoning

Abstract: Recent advances in reinforcement learning with verifiable, rule-based rewards have greatly enhanced the reasoning capabilities and out-of-distribution generalization of VLMs/LLMs, obviating the need for manually crafted reasoning chains. Despite these promising developments in the general domain, their translation to medical imaging remains limited. Besides, current reinforcement fine-tuning (RFT) approaches in medical reasoning are primarily designed for close-ended visual question answering (VQA), where answer choices are provided within the query. This narrow focus limits the model's capacity to leverage world knowledge and adapt to diverse clinical tasks. More importantly, such methods fail to meet the pressing clinical need for open-ended, reasoning-intensive decision-making, which requires generating answers without predefined options—a task proven much more challenging. To bridge this gap, we propose **MedCCO**, the first multi-modal reinforcement learning framework for medical VQA that integrates both close-ended and open-ended data under a curriculum-based RFT strategy. By explicitly fostering open-ended reasoning, MedCCO aims to enhance performance across both reasoning types. Specifically, MedCCO is initially fine-tuned on a diverse set of close-ended medical VQA tasks to establish domain-grounded reasoning capabilities, and is then progressively adapted to open-ended tasks to foster deeper knowledge enhancement and clinical interpretability. We validate MedCCO across eight challenging medical VQA benchmarks, spanning both close-ended and open-ended settings. Experimental results show that MedCCO consistently enhances performance and generalization, achieving a 11.4\% accuracy gain across three in-domain tasks, and a 5.7\% improvement on five out-of-domain benchmarks. These findings highlight the promise of curriculum-guided RL in advancing robust, clinically-relevant reasoning in medical multi-modal language models.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 12283

Loading