TimeSeriesExamAgent: Creating TimeSeries Reasoning Benchmarks at Scale

Malgorzata Gwiazda; Yifu Cai; Mononito Goswami; Artur Dubrawski

TimeSeriesExamAgent: Creating TimeSeries Reasoning Benchmarks at Scale

Malgorzata Gwiazda, Yifu Cai, Mononito Goswami, Artur Dubrawski

Published: 23 Sept 2025, Last Modified: 07 Nov 2025BERT2SEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Time Series Reasoning, Benchmark, Agent, Multimodal Learning

Abstract: We introduce TimeSeriesExamAgent, a scalable and domain-agnostic framework for automatically generating and validating time series reasoning benchmarks. Existing benchmarks lack scalability, are limited to a few specific domains, while building them remains labor intensive. Automated solutions for benchmark creation have been proposed, but these typically rely on a single-step generation process without verification, leading to lower-quality exams. Our framework addresses these limitations by enabling domain experts to easily create high-quality, domain-specific exams from their own datasets. A domain expert provides a dataset, a natural language description, and a simple data-loading method. The agent then orchestrates the generation pipeline, including creating question templates, robustness verification from multiple perspectives, and iterative refinement. We demonstrate the framework on three datasets from two diverse domains-- healthcare and finance; and evaluate multiple state-of-the-art language models on the exams generated by TimeSeriesExamAgent. Empirically, we demonstrate that the framework produces domain-agnostic benchmarks whose diversity matches human-generated counterparts, and our evaluation of several Large Language Models shows that accuracy remains limited, underscoring open challenges in time-series reasoning.

Submission Number: 23

Loading