Confirmation: I have read and agree with the IEEE BHI 2025 conference submission's policy on behalf of myself and my co-authors.
Keywords: Multimodal large language models (MLLMs), medical vision–language understanding, compositional generalization, visual question answering (VQA), MAT triplet schema
TL;DR: We introduce CrossMed, a benchmark to evaluate compositional generalization in MLLMs via MAT triplets, showing strong gains in classification, segmentation, low-data, and cross-task settings using LLaVA-7B and Qwen2-VL models.
Abstract: Recent advances in multimodal large language mod-
els (MLLMs) has enabled unified processing of visual and textual
inputs, with promising implications for general-purpose medical
AI. However, their ability to generalize compositionally across
unseen combinations of imaging modality, anatomy, and task type
remains underexplored. We introduce CrossMed, a benchmark
designed to evaluate compositional generalization (CG) in medi-
cal MLLMs using a structured Modality–Anatomy–Task (MAT)
schema. CrossMed reformulates four public datasets, CheXpert
(X-ray classification), SIIM-ACR (X-ray segmentation), BraTS
2020 (MRI classification and segmentation), and MosMedData
(CT classification) into a unified visual question answering (VQA)
format, resulting in 20,200 multi-choice QA instances. We evalu-
ate two open-source MLLMs, LLaVA-Vicuna-7B and Qwen2-VL-
7B, on both Related and Unrelated MAT splits, as well as a zero-
overlap setting where test triplets share no Modality, Anatomy,
or Task with training data. Models trained on Related splits
achieve 83.2% classification accuracy and 0.75 segmentation
cIoU, while performance drops significantly under Unrelated and
zero-overlap conditions, validating the benchmark’s difficulty.
Furthermore, we show cross-task transfer where segmentation
performance improves by +7% cIoU even when trained using
classification-only data. Traditional models (ResNet-50, U-Net)
benefit modestly, confirming MAT’s broad utility, while MLLMs
uniquely excel at CG. CrossMed provides a rigorous testbed for
evaluating zero-shot, cross-task, and modality-agnostic general-
ization in medical vision-language models.
Track: 7. General Track
Registration Id: J3N9SCVBKNK
Submission Number: 285
Loading