MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

Kangyu Zhu; Peng Xia; Yun Li; Hongtu Zhu; Sheng Wang; Huaxiu Yao

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

Kangyu Zhu, Peng Xia, Yun Li, Hongtu Zhu, Sheng Wang, Huaxiu Yao

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A clinical-aware multimodal preference optimization method for vision-language model

Abstract: The advancement of Large Vision-Language Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately addressed clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. In response, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by 14.2% and 51.7% on the Med-VQA and report generation tasks, respectively. Our code are available in https://github.com/aiming-lab/MMedPO}{https://github.com/aiming-lab/MMedPO.

Lay Summary: Modern AI tools that look at both medical images and text—like X-rays and doctors’ notes—are helping improve healthcare. But they still make mistakes. One big problem is that these systems often rely too heavily on text and don’t pay enough attention to the actual medical images. This can lead to errors, like describing a disease that isn’t there. To fix this, we created a new training method called MMedPO. It helps the AI learn to better balance what it sees in the images with what it reads. We do this by giving the AI examples where it makes common medical mistakes, like overlooking a tumor or giving a believable but incorrect answer. Then, we score how medically important each example is, so the AI learns to focus on the right things. Our results show that this method makes AI much more accurate when answering medical questions or writing reports. We’ve also made our tools freely available for others to use.

Link To Code: https://github.com/aiming-lab/MMedPO}{https://github.com/aiming-lab/MMedPO

Primary Area: Applications->Health / Medicine

Keywords: medical vision-language models, preference optimization

Submission Number: 1468

Loading