TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents

TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents

ICLR 2026 Conference Submission13403 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Agents, Time Series, Scalable Benchmarking, Fine-Grained Evaluation

TL;DR: We introduce TimeSeriesGym, a general environment to evaluate AI Agents on time series machine learning challenges.

Abstract: We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents on time series machine learning engineering challenges. Existing benchmarks lack scalability, focus narrowly on model building in well-defined settings, and evaluate only a limited set of research artifacts (e.g., CSV submission files). To make AI agent benchmarking more relevant to the practice of machine learning engineering, our framework scales along two critical dimensions. First, recognizing that effective ML engineering requires a range of diverse skills, TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks. We design challenges to evaluate both isolated capabilities (including data handling, understanding research repositories, and code translation) and their combinations, and rather than addressing each challenge independently, we develop tools that support designing multiple challenges at scale. Second, we implement evaluation mechanisms for multiple research artifacts, including submission files, code, and models, using precise numeric measures and _optionally_ LLM-based qualitative assessments. This strategy complements objective evaluation with subjective assessment when appropriate. Although our initial focus is on time series applications, our framework can be readily extended to other data modalities, broadly enhancing the comprehensiveness and practical utility of agentic AI evaluation. We [open-source](https://anonymous.4open.science/r/TimeSeriesGym-9CF6/) our benchmarking framework to facilitate future research on the ML engineering capabilities of AI agents.

Primary Area: datasets and benchmarks

Submission Number: 13403

Loading