TRQA: Time Series Reasoning Question And Answering Benchmark

Baoyu Jing; Sanhorn Chen; Lecheng Zheng; Boyu Liu; Zihao Li; Jiaru Zou; Tianxin Wei; Zhining Liu; Zhichen Zeng; Ruizhong Qiu; Xiao Lin; Yuchen Yan; Qi Yu; Dongqi Fu; Jingrui He; Hanghang Tong

TRQA: Time Series Reasoning Question And Answering Benchmark

Baoyu Jing, Sanhorn Chen, Lecheng Zheng, Boyu Liu, Zihao Li, Jiaru Zou, Tianxin Wei, Zhining Liu, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Yuchen Yan, Qi Yu, Dongqi Fu, Jingrui He, Hanghang Tong

18 Sept 2025 (modified: 02 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Time Series, Question Answering

Abstract: Time series data underpin critical applications across domains such as finance, healthcare, transportation, and environmental science. While recent work has be- gun to explore multi-task time series question answering (QA), current bench- marks remain limited in scope, with an emphasis largely on forecasting and anomaly detection tasks. We introduce TRQA, a novel time series QA bench- mark that substantially broadens task coverage and provides a unified setting for evaluating diverse temporal reasoning abilities. TRQA unifies six diverse tasks under a single framework, organized into two complementary groups: (1) con- ventional reasoning tasks, including anomaly detection and classification, and (2) advanced reasoning tasks, such as characterization, comparison, data transfor- mation, and temporal relationship reasoning. These tasks span multiple ques- tion types, such as true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), enabling a more comprehensive evaluation of diverse aspects of time se- ries reasoning. We curated a large-scale dataset with 210k samples, covering a diverse 13 domains, 6 tasks, and 3 types of questions. Each sample consists of one or more time series, an accompanying question, contextual information about the time series, and a corresponding answer. Zero-shot evaluation demon- strates that these tasks are challenging for both commercial and open-source Large Language Models (LLMs). For example, the best-performing commercial LLM, Gemini-2.5-Flash, achieves an average score of only 65.08. While open-source LLMs show notable performance gains after instruction tuning, there remains con- siderale room for improvement. For instance, the best-performing open-source model, LLaMA-3.1-8B, reaches an average score of 85.26, suggesting that these tasks are still non-trivial and pose ongoing challenges for current models.

Primary Area: datasets and benchmarks

Submission Number: 10613

Loading