TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents

Published: 22 Sept 2025, Last Modified: 22 Sept 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Agents, Time Series, Scalable Benchmarking, Fine-Grained Evaluation
Abstract: We introduce _TimeSeriesGym_, a scalable benchmarking framework for evaluating Large Language Model (LLM) agents on time series machine learning (ML) engineering tasks. Existing benchmarks lack scalability, focus narrowly on well-structured problems, and rely mainly on outcome-based metrics (e.g., task success rate). Our framework addresses these gaps along two critical dimensions. First, _TimeSeriesGym_ incorporates challenges from diverse sources spanning multiple domains and problem types, targeting both isolated capabilities (e.g., data handling, hyperparameter tuning, research code migration) and their combinations, with tools that enable scalable challenge generation. Second, _TimeSeriesGym_ supports multimodal, skill-based, and holistic evaluation, combining precise quantitative metrics with flexible LLM-based evaluation approaches to balance objective assessment and contextual judgment. Although our initial focus is on time series applications, our framework can be readily extended to other data modalities, broadly enhancing the comprehensiveness and practical utility of agentic AI evaluation. Our experiments demonstrate that even state-of-the-art agents struggle to solve time series ML engineering tasks, highlighting the need for more competent agents and more comprehensive benchmarks to advance LLM-driven agents for solving real-world ML engineering challenges.
Submission Number: 259
Loading