PULSE: Benchmarking Large Language Models for ICU Time Series Classification

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: icu, intensive care unit, clinical, benchmark, large language model, llm, agents, evaluation, classification
TL;DR: The First LLM Benchmark for ICU Time Series
Abstract: Large language models (LLMs) are increasingly used on multimodal clinical data, yet their performance on high-stakes intensive care unit (ICU) time series data remains under-characterized. We introduce PULSE, a comprehensive benchmark evaluating 17 models, including conventional learners, deep learning, and instruction-following LLMs, across three datasets (HiRID, MIMIC-IV, eICU) and three clinical endpoints (mortality, sepsis, acute kidney injury). In standard within-domain settings, we find that Gradient Boosted Decision Trees (LightGBM) remain the state-of-the-art, achieving mean AUROCs up to 0.916. Frontier LLMs come close (best mean AUROC of 0.893, OpenAI o3), but show sensitivity to the prompting technique. Crucially, while conventional machine learning and deep learning models suffer performance degradation when tested in unseen domains (e.g., XGBoost AUROC dropping to $\approx$0.511, when trained in MIMIC IV and tested in eICU) due to distribution shift, zero shot and few shot prompting and hybrid reasoning LLM workflow demonstrate robust performance. This establishes LLMs not merely as reasoning engines, but as the pragmatic ``day-zero" solution for institutions lacking the labeled data required to train conventional models. PULSE provides all code, configuration files, and a public results dashboard to enable transparent, reproducible comparison and rapid community extension. We expect PULSE to serve as a common yardstick in the years to come, for developing reliable LLMs for multimodal time series data in critical care.
Primary Area: datasets and benchmarks
Submission Number: 14487
Loading