Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring

Aditya Shukla; Yining Yuan; J. Ben Tamo; Yifei Wang; Micky C. Nnamdi; Benoit Louis Marteau; Shaun Qien Yeau Tan; Jieru Li; Brad Willingham; May Dongmei Wang

Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring

Aditya Shukla, Yining Yuan, J. Ben Tamo, Yifei Wang, Micky C. Nnamdi, Benoit Louis Marteau, Shaun Qien Yeau Tan, Jieru Li, Brad Willingham, May Dongmei Wang

Published: 01 Mar 2026, Last Modified: 10 Apr 2026ICLR 2026 TSALM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Presentation Attendance: No, we cannot present in-person

Keywords: Large Language Models, Benchmarks, Time Series Understanding

TL;DR: We propose a benchmark to evaluate large language model's ability to understand patient-monitoring time series data.

Abstract: Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series, yet it remains unclear whether these narratives faithfully capture clinically significant events such as sustained abnormalities. Existing evaluation metrics emphasize semantic similarity and linguistic quality, leaving event-level correctness largely unmeasured. We introduce an event-based evaluation framework for multimodal time-series summarization using the technology-integrated health management (TIHM)-1.5 dementia monitoring data. Clinically grounded daily events are derived via rule-based abnormal thresholds and temporal persistence, and model-generated summaries are aligned to these structured facts. Our protocol measures abnormality recall, duration recall, measurement coverage, and hallucinated event mentions. Benchmarking zero-shot, statistical prompting, and vision-based pipeline using rendered time-series visualizations reveals a striking decoupling: models with high conventional scores often exhibit near-zero abnormality recall, while the vision-based approach achieves the strongest event alignment (45.7% abnormality recall; 100% duration recall). These results highlight the need for event-aware evaluation to ensure reliable clinical time-series summarization.

Track: Research Track (max 4 pages)

Submission Number: 96

Loading