LLM Agents Struggle at Time Series Machine Learning Engineering

Published: 09 Jun 2025, Last Modified: 09 Jun 2025FMSD @ ICML 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Time Series, Agents, Large Language Models, Benchmarking, Machine Learning
TL;DR: Large Language Model (LLM) agents are increasingly used for machine learning (ML) research and engineering tasks, but how well do they handle time series challenges?
Abstract: Large Language Model (LLM) agents are increasingly used for machine learning (ML) research and engineering tasks, but how well do they handle time series challenges? The results of our investigation are not optimistic. The application of agentic AI in support of time series analytics is not in a mature state yet, and its performance is not evaluated comprehensively and thoroughly enough to inspire confidence for real-world applications. Existing benchmarks lack scalability, focus narrowly on model building in idealistic, well-defined settings, and evaluate only a limited set of research artifacts (such as CSV result files often submitted to Kaggle competitions), coming short of assessing other pragmatic aspects of competency of the agentic tools, such as, e.g., data wrangling abilities. Effective ML engineering, whether human- or AI-driven, requires a broad set of diverse skills to competently approach challenges commonly encountered in practice in order to deliver complete solutions. Our experiments demonstrate how state-of-the-art agents struggle to solve time series ML engineering tasks, and how current benchmarks do not challenge them well enough. We argue that our community still needs more competent agents and more comprehensive benchmarks to produce ML engineering LLM-driven agents capable of solving real world time series challenges.
Submission Number: 88
Loading