BEDTIME: A Unified Benchmark for Automatically Describing Time Series

Published: 22 Sept 2025, Last Modified: 22 Sept 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Time Series, Descriptions, Captioning, Language Models, Vision-Language Models
Abstract: Many recent works have proposed general-purpose foundation models for a wide range of time series analysis tasks. However, most models are introduced alongside new datasets, leaving a lack of head-to-head comparisons. They also often study complex tasks, making it hard to isolate specific model capabilities. To address these gaps, we formalize and evaluate 3 tasks that test a model's ability to describe time series using language: **(1)** Recognition, **(2)** Differentiation, and **(3)** Generation. We then unify 4 recent datasets to enable head-to-head model comparisons on each task. Experimentally, in evaluating 13 state-of-the-art language, vision--language, and time series--language models, we find that **(1)** popular language-only methods largely underperform, indicating a need for time series-specific architectures, **(2)** VLMs are quite successful, as expected, identifying the value of vision models for these tasks, and **(3)** pre-trained multimodal time series--language models successfully outperform LLMs, but still have significant room for improvement. We also find that all approaches exhibit clear fragility in a range of robustness tests. Overall, our benchmark provides a standardized evaluation on a fundamental task towards enabling capable time series reasoning systems.
Submission Number: 31
Loading