Exploring RAG-driven Multimodal LLMs for Explainable ECG Interpretation

Minsol Kim; Du Liu; Ragasudha Botta; Seok Gyu Han; Man Jiang; Jason Gusdorf; Patricia Maes; Paul Pu Liang

Exploring RAG-driven Multimodal LLMs for Explainable ECG Interpretation

Minsol Kim, Du Liu, Ragasudha Botta, Seok Gyu Han, Man Jiang, Jason Gusdorf, Patricia Maes, Paul Pu Liang

Published: 23 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Workshop BrainBodyFMEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Electrocardiography (ECG), arrhythmia classification, foundation models, multimodal learning, retrieval-augmented generation (RAG), interpretability, bias and fairness, clinical AI

TL;DR: We evaluate multimodal RAG frameworks with LLMs for ECG interpretation, showing how metadata, prompts, and retrieval shape performance, and highlight opportunities and challenges for equitable, scalable clinical systems.

Abstract: Foundation models offer a path toward safer, more generalizable biosignal analysis, yet their role in clinical Electrocardiograms (ECGs) interpretation remains unclear. We present a multimodal framework with Retrieval-Augmented Generation (RAG) that couples 12-lead ECG waveforms with structured patient metadata (age, sex) to guide large language models (LLMs) and a downstream ECG encoder. Using PTB-XL for benchmarking and MIMIC-IV-ECG for pretraining, we compare four LLMs within the same RAG pipeline—GPT-4, GPT-3.5, Claude 3.5 Sonnet, and Llama 3.2—for arrhythmia classification across five diagnostic superclasses. GPT-4 achieves the best overall performance (AUC = 0.940) and maintains stable accuracy when demographic features are ablated or added, suggesting reduced sensitivity to demographic shifts. Metadata provided the largest relative gains for models with lower performance in our benchmark, while qualitative analysis highlighted remaining failure modes such as hallucinations and retrieval mismatches. These results indicate that RAG-driven multimodal prompting can enhance ECG interpretation and fairness when paired with high-performing foundation models, and we outline practical levers—retrieval design, prompt structure, and metadata integration—for building equitable and scalable clinical ECG systems.

Submission Number: 63

Loading