Keywords: lvlm, vision language models, self-alignment, robustness, explainability, llm, lmm
TL;DR: A novel technique for self-alignment of Large-Vision Language Models for Robust Chest X-ray Diagnosis
Abstract: Large vision–language models (LVLMs) hold promise for medical imaging but face two critical
challenges: dependence on curated human-annotated datasets for alignment and poor robustness to
real-world perturbations. We show that LVLMs can produce inconsistent outputs between original chest X-rays and WhatsApp-compressed versions that appear visually indistinguishable. Such failures raise serious concerns for mHealth platforms, where compressed or perturbed images are common in real-world diagnostic workflows. Moreover, current LVLMs often attribute lung abnormality predictions to irrelevant regions outside the lungs—a phenomenon we term out-of-lung saliency (OLS)—which is exacerbated by compression artifacts. These challenges highlight the urgent need for robust and explainable LVLMs in CXR diagnosis.
To address these issues, we propose Self-CXRAlign, a self-alignment framework that enhances
explainability robustness through multi-task learning (MTL)-driven supervised fine-tuning (SFT). Self-CXRAlign enforces explainability robustness, ensuring stability of predictions and attributions across original and perturbed images. Central to our method is the Inter-Task Attribution Conflict Score (TACS), a novel metric that guides the selection of auxiliary tasks to reduce attribution conflicts and mitigate negative transfer. By steering SFT with TACS, Self-CXRAlign achieves up to 80% reduction in OLS compared to naïve MTL, paving the way for explainable and trustworthy LVLM deployment in mHealth-driven chest X-ray analysis.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11895
Loading