Keywords: medical visual question answering, multimodal robustness, fusion analysis, robustness under distribution shift, cross-modal perturbations
TL;DR: Diagnostic robustness framework for Med-VQA, showing that encoder choice strongly affects robustness, cross-modal perturbations dominate failures, and fusion drift correlates with accuracy loss, revealing an expressivity–error containment tradeoff.
Abstract: Medical Visual Question Answering (Med-VQA) systems are increasingly used in clinical decision support, yet their robustness under distribution shift remains poorly understood. Existing evaluations focus on clean accuracy and provide limited insight into why the models fail. We introduce a diagnostic framework that decomposes robustness failures into encoder instability, cross-modal error propagation, and fusion-induced error amplification. Across SLAKE, PathVQA, and VQA-RAD, we show that encoder choice alone causes up to 12% variation in calibration and up to 25% variation in robustness drop, motivating principled encoder selection before comparing fusion methods. We then evaluate fusion architectures of varying complexity under visual, textual, cross-modal, and fusion-specific clinical perturbations. Cross-modal perturbations consistently produce the largest accuracy drops and the lowest consistency, highlighting modality misalignment as a dominant failure mode. Fusion representation drift strongly correlates with performance degradation (Spearman
= 0.79), and attention-based fusion increases modality entanglement, revealing an inherent trade-off between fusion expressivity and failure localization. Overall, our results look beyond accuracy-centric evaluation to improve the reliability of multi-modal clinical systems.
Submission Number: 126
Loading