CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging

Published: 19 Aug 2025, Last Modified: 12 Oct 2025BHI 2025EveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the IEEE BHI 2025 conference submission's policy on behalf of myself and my co-authors.
Keywords: Multimodal large language models (MLLMs), medical vision–language understanding, compositional generalization, visual question answering (VQA), MAT triplet schema
TL;DR: We introduce CrossMed, a benchmark to evaluate compositional generalization in MLLMs via MAT triplets, showing strong gains in classification, segmentation, low-data, and cross-task settings using LLaVA-7B and Qwen2-VL models.
Abstract: Recent advances in multimodal large language mod- els (MLLMs) has enabled unified processing of visual and textual inputs, with promising implications for general-purpose medical AI. However, their ability to generalize compositionally across unseen combinations of imaging modality, anatomy, and task type remains underexplored. We introduce CrossMed, a benchmark designed to evaluate compositional generalization (CG) in medi- cal MLLMs using a structured Modality–Anatomy–Task (MAT) schema. CrossMed reformulates four public datasets, CheXpert (X-ray classification), SIIM-ACR (X-ray segmentation), BraTS 2020 (MRI classification and segmentation), and MosMedData (CT classification) into a unified visual question answering (VQA) format, resulting in 20,200 multi-choice QA instances. We evalu- ate two open-source MLLMs, LLaVA-Vicuna-7B and Qwen2-VL- 7B, on both Related and Unrelated MAT splits, as well as a zero- overlap setting where test triplets share no Modality, Anatomy, or Task with training data. Models trained on Related splits achieve 83.2% classification accuracy and 0.75 segmentation cIoU, while performance drops significantly under Unrelated and zero-overlap conditions, validating the benchmark’s difficulty. Furthermore, we show cross-task transfer where segmentation performance improves by +7% cIoU even when trained using classification-only data. Traditional models (ResNet-50, U-Net) benefit modestly, confirming MAT’s broad utility, while MLLMs uniquely excel at CG. CrossMed provides a rigorous testbed for evaluating zero-shot, cross-task, and modality-agnostic general- ization in medical vision-language models.
Track: 7. General Track
Registration Id: J3N9SCVBKNK
Submission Number: 285
Loading