Can Vision Language Models Track a Heartbeat? A Benchmark on Frame-Level Echocardiogram Understanding
Keywords: Echocardiography, Vision Language Models, Benchmarking
Abstract: Echocardiogram videos are among the most common and clinically vital imaging modali-
ties in cardiovascular medicine. They capture dynamic cardiac motion, and their accurate
functional assessment requires frame-level temporal precision. Ejection fraction (EF) is
an essential metric for assessing cardiac function and is computed from the left-ventricular
volumes at end-diastole (EDV) and end-systole (ESV), making its estimation inherently de-
pendent on accurate frame-wise temporal reasoning. Vision Language Models (VLMs) have
recently shown strong performance in general video understanding. However, whether they
can reliably reason over the fine-grained temporal dynamics required for echocardiographic
interpretation remains unclear.
We benchmarked six state-of-the-art open-source VLMs, Gemma 3n, LLaVA-Interleave,
LLaVA-NeXT-Video 7B/34B, and Qwen3-VL 8B/32B, on the clinically motivated task of
frame-level EDV/ESV localization in apical four-chamber echocardiograms. All models
performed poorly on this localization task, with errors far beyond clinically acceptable
tolerances, and in some cases indistinguishably from random Monte Carlo baselines. To
further test whether explicit structural guidance could compensate for limited temporal rea-
soning, we additionally provided left-ventricular segmentation overlays as auxiliary visual
input for both tasks. However, even with segmentation cues, performance gains remained
negligible in this tasks. Prompting the model to focus on masked areas only, omitting any
medical context, did not lead to marked improvements.
To reduce the complexity to pure size comparison, we further evaluated a simplified
two-frame binary classification task in which each model must distinguish end-diastole
(ED) from end-systole (ES). Despite this simplification, performance remained low for most
models on original videos, only Qwen3-VL-32B reaches an accuracy of 0.711. Providing
segmentation overlays and ignoring medical background knowledge only helped Qwen3-VL
in both sizes reaches accuracy over 0.9, with other models resulting in random level.
This work presents the first systematic evaluation of general-purpose VLMs on echocar-
diogram video analysis across progressively simplified temporal reasoning tasks. Our results
reveal a fundamental limitation of current VLMs in frame-level cardiac ultrasound interpre-
tation. This work highlights the importance of medical benchmarks for VLMs and the need
for domain-specific temporal modeling in future medical VLMs. To facilitate benchmark-
ing of VLMs on echocardiogram video analysis, we make the benchmark and all associated
code publicly available here.
Primary Subject Area: Application: Cardiology
Secondary Subject Area: Detection and Diagnosis
Registration Requirement: Yes
Reproducibility: https://github.com/DingmingL/Heartbeat-tracking-benchmark
Visa & Travel: No
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Midl Latex Submission Checklist: Ensure no LaTeX errors during compilation., Replace NNN with your OpenReview submission ID., Includes \documentclass{midl}, \jmlryear{2026}, \jmlrworkshop, \jmlrvolume, \editors, and correct \bibliography command., Did not override options of the hyperref package., Did not use the times package., Use the correct spelling and format, avoid Unicode characters, and use LaTeX equivalents instead., Any math in the title and abstract must be enclosed within $...$., Did not override the bibliography style defined in midl.cls and did not use \begin{thebibliography} directly to insert references., Avoid using \scalebox; use \resizebox when needed., Included all necessary figures and removed *unused* files in the zip archive., Removed special formatting, visual annotations, and highlights used during rebuttal., All special characters in the paper and .bib file use LaTeX commands (e.g., \'e for é)., No separate supplementary PDF uploads., Acknowledgements, references, and appendix must start after the main content.
Latex Code: zip
Copyright Form: pdf
Submission Number: 33
Loading