Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios

Zhongzhen Huang; Linjie Mu; Xiangyu Zhao; Yakun Zhu; Xiaofan Zhang; Shaoting Zhang

Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios

Zhongzhen Huang, Linjie Mu, Xiangyu Zhao, Yakun Zhu, Xiaofan Zhang, Shaoting Zhang

11 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Reasoning; Medical Multimodal model

TL;DR: A novel training pipeline and corpus for advancing multimodal reasoning in medical scenarios.

Abstract: Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose MedE$^2$, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using a limited number of text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model’s reasoning quality using rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of MedE$^2$ in improving the reasoning performance of medical multimodal models. Notably, models trained with MedE$^2$ consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.

Supplementary Material: pdf

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Submission Number: 4096

Loading