When LLM Meets Time Series: A Real-World Benchmark for Explicit and Implicit Multi-Step Reasoning

Muyan Weng; Jinbo Liu; Defu Cao; Wen Ye; Kexin Zhang; Wei Yang; Yan Liu

When LLM Meets Time Series: A Real-World Benchmark for Explicit and Implicit Multi-Step Reasoning

Muyan Weng, Jinbo Liu, Defu Cao, Wen Ye, Kexin Zhang, Wei Yang, Yan Liu

20 Sept 2025 (modified: 22 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Time Series Agent, Large Language Models, Benchmarking, Time Series Multi-step reasoning

Abstract: The rapid advancement of Large Language Models (LLMs) has sparked growing interest in their application to time series analysis. Yet, their ability to perform complex reasoning over temporal data remains underexplored. A rigorous benchmark is a crucial first step toward systematic evaluation. In this work, we present the TSAIA Benchmark, a comprehensive framework for assessing LLMs as time-series artificial intelligence assistants. TSAIA integrates two complementary tiers of tasks. The series-centric tier instantiates canonical time-series formulations—such as forecasting, anomaly detection, and risk-return analysis—via a controlled question-generation pipeline, providing continuity with prior evaluation settings. The problem-centric tier, in contrast, derives tasks from real-world analytical questions in healthcare, retail, and climate science, and formalizes their construction through a task-design paradigm spanning three levels: evidence integration, operator-based comparison, and structural multi-step reasoning. This paradigm enables dynamic extensibility, allowing new task instances to be generated as data evolve in practice. To accommodate heterogeneous task types, we define task-specific success criteria and tailored inference quality metrics, applied under a unified evaluation protocol. We evaluate 7 state-of-the-art LLMs and find that while they achieve reasonable performance on series-centric tasks, they struggle substantially on problem-centric ones, often failing at multi-step reasoning, numerical precision, and constraint adherence. These results underscore the need for domain-grounded, dynamically extensible benchmarks as a foundation for advancing LLM-based time-series assistants.

Primary Area: learning on time series and dynamical systems

Submission Number: 24423

Loading