Keywords: Neonatal Heart Rate Time Series, Descriptions, Captioning, Language Models, Vision-Language Models
Abstract: In Neonatal Intensive Care Units (NICUs) heart-rate monitoring produces continuous time-series signals that, combined with clinical metadata, are critical for early warning and decision support. Traditional statistical models cannot not effectively incorporate textual inputs, leaving clinical information unused in prediction. Recent advances in multimodal language models (LMs) enable aligning temporal signals with textual clinical metadata. We propose a two-step framework to test whether combining numerical time-series and clinical text yields better predictions, by: first, testing LMs' recognition and differentiation capabilities of clinical descriptions tied to temporal and visual properties of NICU heart rate signals; second, evaluating the transfer of this ability to a downstream clinically significant task of 7-day mortality prediction . Results show that descriptive performance strongly correlates to mortality prediction accuracy, with patient metadata and clinical descriptions boosting outcomes, especially for larger models. Vision-Language Models (VLMs) perform best overall, while specialized Time Series Language Models (TSLMs) consistently surpass their base large language models (LLMs). Overall our work provides (1) a controlled evaluation framework linking time series understanding to clinically meaningful downstream tasks, (2) quantification of the added value of metadata and descriptions, and (3) evidence that aligning time series with linguistic understanding is transferable to high-stakes clinical tasks.
Submission Number: 119
Loading