Keywords: Time Series, Question Answering
Abstract: Time series data underpin critical applications across domains such as finance,
healthcare, transportation, and environmental science. While recent work has be-
gun to explore multi-task time series question answering (QA), current bench-
marks remain limited in scope, with an emphasis largely on forecasting and
anomaly detection tasks. We introduce TRQA, a novel time series QA bench-
mark that substantially broadens task coverage and provides a unified setting for
evaluating diverse temporal reasoning abilities. TRQA unifies six diverse tasks
under a single framework, organized into two complementary groups: (1) con-
ventional reasoning tasks, including anomaly detection and classification, and (2)
advanced reasoning tasks, such as characterization, comparison, data transfor-
mation, and temporal relationship reasoning. These tasks span multiple ques-
tion types, such as true-or-false (TF), multiple-choice (MC), and a novel puzzling
(PZ), enabling a more comprehensive evaluation of diverse aspects of time se-
ries reasoning. We curated a large-scale dataset with 210k samples, covering
a diverse 13 domains, 6 tasks, and 3 types of questions. Each sample consists
of one or more time series, an accompanying question, contextual information
about the time series, and a corresponding answer. Zero-shot evaluation demon-
strates that these tasks are challenging for both commercial and open-source Large
Language Models (LLMs). For example, the best-performing commercial LLM,
Gemini-2.5-Flash, achieves an average score of only 65.08. While open-source
LLMs show notable performance gains after instruction tuning, there remains con-
siderale room for improvement. For instance, the best-performing open-source
model, LLaMA-3.1-8B, reaches an average score of 85.26, suggesting that these
tasks are still non-trivial and pose ongoing challenges for current models.
Primary Area: datasets and benchmarks
Submission Number: 10613
Loading