Keywords: Time Series, Descriptions, Captioning, Language Models, Vision-Language Models
Abstract: Recent works propose complex multi-modal models that handle both time series and language, ultimately claiming high performance on complex tasks like time series reasoning and cross-modal question-answering. However, they skip evaluations of simple and important foundational tasks, which complex models should reliably master. They also lack direct, head-to-head comparisons with other popular approaches. So we ask a simple question: *Can recent models even produce generic visual descriptions of time series data?* In response, we propose three new tasks, posing that successful multi-modal models should be able to *recognize, differentiate,* and *generate* language descriptions of time series. We then create **BEDTime**, the first benchmark dataset to assess models on each task, comprising four datasets reformatted for these tasks across multiple modalities. Using **BEDTime**, we evaluate 13 state-of-the-art models, and find that (1) surprisingly, dedicated time series foundation models severely underperform, despite being designed for similar tasks, (2) vision–language models are quite capable, (3) language-only methods perform worst, despite many lauding their potential, (4) all approaches are clearly fragile to a range of realistic robustness tests, indicating avenues for future work.*
*All of our code and data needed to reproduce our results will be made public.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 19059
Loading