ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models

Duy Minh Ho Nguyen; Nghiem Tuong Diep; Trung Quoc Nguyen; Hoang-Bao Le; Tai Nguyen; Anh-Tien Nguyen; TrungTin Nguyen; Nhat Ho; Pengtao Xie; Roger Wattenhofer; Daniel Sonntag; James Zou; Mathias Niepert

ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models

Duy Minh Ho Nguyen, Nghiem Tuong Diep, Trung Quoc Nguyen, Hoang-Bao Le, Tai Nguyen, Anh-Tien Nguyen, TrungTin Nguyen, Nhat Ho, Pengtao Xie, Roger Wattenhofer, Daniel Sonntag, James Zou, Mathias Niepert

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-modal LLM, multi-modal learning, representation learning, healthcare

TL;DR: a new learning scheme for multi-modal LLM, LLAVA

Abstract: State-of-the-art medical multi-modal LLMs (med-MLLMs), such as LLaVA-Med and BioMedGPT, primarily depend on scaling model size and data volume, with training driven largely by autoregressive objectives. However, we reveal that this approach can lead to weak vision-language alignment, making these models overly dependent on costly instruction-following data. To address this, we introduce ExGra-Med, a novel multi-graph alignment framework that jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence. To scale to large LLMs (e.g., LLaMa-7B), we develop an efficient end-to-end training scheme using black-box gradient estimation, enabling fast and scalable optimization. Empirically, ExGra-Med matches LLaVA-Med’s performance using just 10\% of pre-training data, achieving a 20.13\% gain on VQA-RAD and approaching full-data performance. It also outperforms strong baselines like BioMedGPT and RadFM on visual chatbot and zero-shot classification tasks, demonstrating its promise for efficient, high-quality vision-language integration in medical AI.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 23285

Loading