The Overfitting Crisis in LLM Workflows: Learning from Machine Learning’s Past Mistakes

The Overfitting Crisis in LLM Workflows: Learning from Machine Learning’s Past Mistakes

16 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Overfitting Crisis, LLM Workflows

TL;DR: The paper links past ML overfitting to current LLM data misuse and proposes best practices to protect AI research credibility

Abstract: The rapid development of sophisticated Large Language Model (LLM) workflows—including agentic systems, multi-step reasoning pipelines, and tool-integrated approaches—has led to impressive reported performance across various benchmarks. However, we argue that the field is repeating a critical mistake from early machine learning: reporting results on data that has been implicitly used for training or optimization. The complexity of modern LLM workflows obscures the fact that iterative prompt engineering, benchmark-driven development, and workflow refinement constitute a form of training on evaluation data. This position paper draws parallels to historical overfitting practices in ML, documents how current LLM development methodologies systematically conflate training and testing data, and proposes best practices to address this growing methodological crisis before it undermines the credibility of AI research, particularly in scientific applications.

Supplementary Material: zip

Submission Number: 303

Loading