Multimodal Carotid Risk Stratification with Large Vision-Language Models: Benchmarking, Fine-Tuning, and Clinical Insights

Published: 02 Jun 2026, Last Modified: 02 Jun 2026Greeks in AI 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: carotid ultrasound, risk stratification, large vision-language models, zero-shot evaluation, model adaptation
Domains: Vision and Learning, Language and Learning, AI for Health
Abstract: Reliable risk assessment for carotid atheromatous disease requires integrating diverse clinical and imaging information in a transparent and interpretable manner. This study investigates state-of-the-art large vision–language models (LVLMs) for multimodal carotid plaque assessment by integrating ultrasound imaging with structured clinical, demographic, laboratory, and biomarker data. A framework that simulates realistic diagnostic scenarios through interview-style question sequences is proposed, comparing open-source LVLMs including general-purpose and medically tuned models. Zero-shot experiments reveal that while most LVLMs accurately identify imaging modality and anatomy, all perform poorly in risk stratification. To address this, LLaVa-NeXT-Vicuna is adapted using low-rank adaptation (LoRA), achieving substantial improvements in stroke risk stratification. Integrating multimodal tabular data as text further enhances specificity and balanced accuracy, yielding competitive performance compared to prior CNN baselines. Our findings highlight both promise and limitations of LVLMs in ultrasound-based cardiovascular risk prediction, underscoring the importance of multimodal integration, domain adaptation, and model calibration for clinical translation.
Submission Number: 16
Loading