LoGra-Med: Long-Context Multi-Graph Alignment for Medical Visual-Language Models

Duy Minh Ho Nguyen; Nghiem Tuong Diep; Trung Quoc Nguyen; Bao Hoang Le; Tai Nguyen; Anh-Tien Nguyen; TrungTin Nguyen; Nhat Ho; Pengtao Xie; Roger Wattenhofer; James Zou; Daniel Sonntag; Mathias Niepert

LoGra-Med: Long-Context Multi-Graph Alignment for Medical Visual-Language Models

Duy Minh Ho Nguyen, Nghiem Tuong Diep, Trung Quoc Nguyen, Bao Hoang Le, Tai Nguyen, Anh-Tien Nguyen, TrungTin Nguyen, Nhat Ho, Pengtao Xie, Roger Wattenhofer, James Zou, Daniel Sonntag, Mathias Niepert

26 Sept 2024 (modified: 23 Jan 2025)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-modal LLM, AI for Healthcare, multi-modal learning

TL;DR: multi-graph alignment algorithm to train medical multi-modal LLM

Abstract: State-of-the-art medical multi-modal large language models (med-MLLM), such as LLAVA-MED or BIOMEDGPT, leverage instruction-following data in their pre-training stages. However, those models primarily focus on scaling the model size and data volume to boost performance while mainly relying on the autoregressive learning objectives. Surprisingly, we reveal that such learning schemes might result in a weak alignment between vision and language modalities, making these models highly reliant on extensive pre-training datasets — a significant challenge in medical domains due to the expensive and time-consuming nature of curating high-quality instruction-following instances. We address this challenge with a new multi-graph alignment algorithm, namely LOGRA-MED, which enforces triplet correlations on the latent embedding space among image modalities, conversation-based descriptions, and extended contextual captions. Owing to this technique, the model is encouraged to capture the semantic meaning of the context, handle linguistic variability where the captions or questions may differ from training instances, and learn cross-modal associations, linking visual elements with various textual interpretations. To scale our algorithm to the med-MLLM setting, we also design an efficient end-to-end learning scheme based on advanced black-box gradient-estimation techniques that permit fast forward and backward steps through the LLM model (LLaMa 7B). Empirical results show that we can match the performance of LLAVA-Med pre-trained on 600K image-text pairs from PMC-15M for Medical VQA tasks and significantly outperform it when trained on only 10% of the data. For instance, on VQA-RAD, we exceed LLAVA-Med (both trained on 10%) by 20.13% and achieve near parity with the 100% pre-training setting (72.52% vs. 72.64%). Additionally, we also surpass other SOTA pre-training methods and med-MLLM such as BIOMEDGPT on visual chatbot or RADFM on zero-shot image classification with VQA, showcasing the power of multi-graph alignment in improving vision-language integration for medical-MLLM.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8037

Loading