TRQA: Time Series Reasoning Question And Answering Benchmark

18 Sept 2025 (modified: 02 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Time Series, Question Answering
Abstract: Time series data underpin critical applications across domains such as finance, healthcare, transportation, and environmental science. While recent work has be- gun to explore multi-task time series question answering (QA), current bench- marks remain limited in scope, with an emphasis largely on forecasting and anomaly detection tasks. We introduce TRQA, a novel time series QA bench- mark that substantially broadens task coverage and provides a unified setting for evaluating diverse temporal reasoning abilities. TRQA unifies six diverse tasks under a single framework, organized into two complementary groups: (1) con- ventional reasoning tasks, including anomaly detection and classification, and (2) advanced reasoning tasks, such as characterization, comparison, data transfor- mation, and temporal relationship reasoning. These tasks span multiple ques- tion types, such as true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), enabling a more comprehensive evaluation of diverse aspects of time se- ries reasoning. We curated a large-scale dataset with 210k samples, covering a diverse 13 domains, 6 tasks, and 3 types of questions. Each sample consists of one or more time series, an accompanying question, contextual information about the time series, and a corresponding answer. Zero-shot evaluation demon- strates that these tasks are challenging for both commercial and open-source Large Language Models (LLMs). For example, the best-performing commercial LLM, Gemini-2.5-Flash, achieves an average score of only 65.08. While open-source LLMs show notable performance gains after instruction tuning, there remains con- siderale room for improvement. For instance, the best-performing open-source model, LLaMA-3.1-8B, reaches an average score of 85.26, suggesting that these tasks are still non-trivial and pose ongoing challenges for current models.
Primary Area: datasets and benchmarks
Submission Number: 10613
Loading