Presentation Attendance: No, we cannot present in-person
Keywords: Large Language Models, Benchmarks, Time Series Understanding
TL;DR: We propose a benchmark to evaluate large language model's ability to understand patient-monitoring time series data.
Abstract: Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series, yet it remains unclear whether these narratives faithfully capture clinically significant events such as sustained abnormalities. Existing evaluation metrics emphasize semantic similarity and linguistic quality, leaving event-level correctness largely unmeasured. We introduce an event-based evaluation framework for multimodal time-series summarization using the technology-integrated health management (TIHM)-1.5 dementia monitoring data. Clinically grounded daily events are derived via rule-based abnormal thresholds and temporal persistence, and model-generated summaries are aligned to these structured facts. Our protocol measures abnormality recall, duration recall, measurement coverage, and hallucinated event mentions.
Benchmarking zero-shot, statistical prompting, and vision-based pipeline using rendered time-series visualizations reveals a striking decoupling: models with high conventional scores often exhibit near-zero abnormality recall, while the vision-based approach achieves the strongest event alignment (45.7% abnormality recall; 100% duration recall).
These results highlight the need for event-aware evaluation to ensure reliable clinical time-series summarization.
Track: Research Track (max 4 pages)
Submission Number: 96
Loading