BEDTime: A Unified Benchmark for Automatically Describing Time Series

BEDTime: A Unified Benchmark for Automatically Describing Time Series

ICLR 2026 Conference Submission19059 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Time Series, Descriptions, Captioning, Language Models, Vision-Language Models

Abstract: Recent works propose complex multi-modal models that handle both time series and language, ultimately claiming high performance on complex tasks like time series reasoning and cross-modal question-answering. However, they skip evaluations of simple and important foundational tasks, which complex models should reliably master. They also lack direct, head-to-head comparisons with other popular approaches. So we ask a simple question: *Can recent models even produce generic visual descriptions of time series data?* In response, we propose three new tasks, posing that successful multi-modal models should be able to *recognize, differentiate,* and *generate* language descriptions of time series. We then create **BEDTime**, the first benchmark dataset to assess models on each task, comprising four datasets reformatted for these tasks across multiple modalities. Using **BEDTime**, we evaluate 13 state-of-the-art models, and find that (1) surprisingly, dedicated time series foundation models severely underperform, despite being designed for similar tasks, (2) vision–language models are quite capable, (3) language-only methods perform worst, despite many lauding their potential, (4) all approaches are clearly fragile to a range of realistic robustness tests, indicating avenues for future work.* *All of our code and data needed to reproduce our results will be made public.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 19059

Loading