\documentclass{article}

% ready for submission
\usepackage{agents4science_2025}

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{natbib}         % citations

\title{The Overfitting Crisis in LLM Workflows: Learning from Machine Learning's Past Mistakes}

\author{%
  Claude Sonnet\thanks{AI Agent - Primary author responsible for conceptualization, literature analysis, writing, and argumentation.} \\
  Anthropic \\
  \texttt{claude@anthropic.com} \\
  \And
  Anonymous Human Co-author \\
  Advisory Institution \\
  \texttt{advisor@institution.edu} \\
}

\begin{document}

\maketitle

\begin{abstract}
The rapid development of sophisticated Large Language Model (LLM) workflows—including agentic systems, multi-step reasoning pipelines, and tool-integrated approaches—has led to impressive reported performance across various benchmarks. However, we argue that the field is repeating a critical mistake from early machine learning: reporting results on data that has been implicitly used for training or optimization. The complexity of modern LLM workflows obscures the fact that iterative prompt engineering, benchmark-driven development, and workflow refinement constitute a form of training on evaluation data. This position paper draws parallels to historical overfitting practices in ML, documents how current LLM development methodologies systematically conflate training and testing data, and proposes best practices to address this growing methodological crisis before it undermines the credibility of AI research, particularly in scientific applications.
\end{abstract}

\section{Introduction}

The machine learning community learned hard lessons about overfitting in the early 2000s. Researchers would tune hyperparameters based on test set performance, report those same test results, and wonder why deployed models underperformed. The solution was clear: strict separation between training, validation, and test sets, with test sets remaining truly held-out until final evaluation.

Today, we witness an eerily similar pattern in Large Language Model (LLM) research. Teams build increasingly sophisticated workflows—multi-agent systems, complex prompting strategies, tool-integrated pipelines—and iteratively refine them against specific benchmarks. They then report performance on those same benchmarks as if evaluating a model that has never seen the data before. The complexity of these workflows masks what would be obvious overfitting in traditional ML settings.

This paper argues that the LLM community is systematically making the same mistake that plagued early ML research: conflating training and testing data. While individual components (the base LLMs) may not be fine-tuned on evaluation data, the workflows themselves are being optimized against the very benchmarks used for reporting results.

Recent work has documented extensive data contamination in LLM training \citep{golchin2023time, shi2024rethinking, ren2024benchmarking}, where evaluation benchmarks appear in pre-training data. However, we identify a related but distinct problem: even when base models avoid direct contamination, the \emph{workflows} built on top of them are systematically overfitted to evaluation benchmarks through iterative development processes.

\section{The Historical Parallel: ML's Overfitting Problem}

In early machine learning research, a common antipattern emerged. Researchers would select and tune models based on performance on available datasets, with hyperparameter optimization, feature engineering, and architecture choices driven by test set performance. Final results were reported on the same datasets used for development, leading to real-world deployment showing significant performance degradation.

The ML community recognized this as a fundamental methodology error. The "test set" had effectively become part of the training process through repeated evaluation and optimization cycles. This led to established protocols: clear separation of training, validation, and test sets; test sets remaining untouched until final evaluation; cross-validation and holdout validation for development; and independent evaluation on truly unseen data.

These practices became foundational to credible ML research and are now standard in the field.

\section{The Current Crisis in LLM Workflows}

\subsection{The Implicit Training Problem}

Modern LLM workflows involve extensive optimization processes that constitute implicit training on evaluation data. Teams spend weeks iterating on prompts, testing variants against target benchmarks, and selecting the best-performing approaches \citep{wei2022chain}. Each iteration uses the benchmark as a validation signal.

Multi-agent systems, RAG pipelines, and tool usage patterns are designed and refined based on performance on specific tasks and datasets. The architecture itself becomes optimized for known evaluation criteria. Research teams explicitly target improvements on established benchmarks, using them as development objectives rather than independent evaluation metrics.

\subsection{Complexity as Camouflage}

The sophistication of modern LLM workflows obscures the overfitting problem. Complex chains of thought, tree search, and multi-agent interactions create an illusion of generalization, when in fact the entire pipeline has been optimized for specific evaluation patterns \citep{yao2023tree}.

Systems that use calculators, code interpreters, and web search appear more general, but tool selection, usage patterns, and integration strategies are typically optimized against known benchmarks. Workflows that adapt their strategies based on input characteristics seem more robust, but these adaptation mechanisms are usually developed and tuned using the same evaluation data they will later be tested on.

\subsection{Benchmark Contamination and Spillover Effects}

The problem is exacerbated by several forms of contamination. Many benchmarks overlap with or derive from data sources used in LLM pre-training, creating subtle forms of data leakage \citep{ren2024benchmarking, shi2024rethinking}. Recent work has shown that simple paraphrasing can bypass decontamination measures, and that models can achieve artificially high performance when such variations are not eliminated \citep{shi2024rethinking}.

Solutions and techniques developed for one benchmark quickly propagate to others, creating implicit optimization across multiple evaluation sets. As models improve on existing benchmarks, new versions are created that often share similar patterns and evaluation criteria, perpetuating the contamination cycle.

\section{Related Work}

\subsection{Historical Overfitting in Machine Learning}

The machine learning community has long recognized the dangers of overfitting to evaluation data. Early ML research suffered from researchers repeatedly testing models on the same evaluation sets, leading to inflated performance estimates that failed to generalize \citep{chandrasekaran2023test}. This led to the establishment of rigorous evaluation protocols including proper train/validation/test splits and holdout methodologies.

The development of comprehensive benchmarking frameworks helped standardize evaluation practices \citep{olson2017pmlb}. However, as noted by \citet{bouthillier2021recommendations}, benchmark datasets themselves can become overoptimized targets when the entire research community focuses on the same evaluation sets over extended periods.

\subsection{Data Contamination in Large Language Models}

Recent work has extensively documented data contamination issues in LLM training and evaluation. \citet{golchin2023time} and \citet{shi2024rethinking} demonstrate that simple decontamination methods like n-gram matching are insufficient, as paraphrasing and translation can easily bypass these measures. \citet{ren2024benchmarking} found evidence of benchmark leakage across major LLMs, with the opacity of training processes making contamination difficult to detect.

The LessLeak-Bench study \citep{li2025lessleak} provided comprehensive analysis across 83 software engineering benchmarks, finding contamination across nearly all tested models. This systematic contamination undermines reported performance improvements in domains where sophisticated LLM workflows are increasingly deployed.

\subsection{Evaluation Methodology in ML Systems}

Best practices for ML system evaluation emphasize the importance of rigorous testing throughout the system lifecycle \citep{chandrasekaran2023test}. Modern ML operations frameworks \citep{mlops2024crisp} stress the need for continuous evaluation and monitoring to prevent model staleness and performance degradation.

The emergence of specialized benchmarking tools and frameworks \citep{olson2017pmlb, mlsys2024benchmarking} reflects the community's recognition that proper evaluation requires systematic approaches rather than ad-hoc practices.

\section{Detailed Case Studies: Workflow Overfitting in Practice}

\subsection{ReAct: Reasoning and Acting Workflows}

The ReAct framework \citep{yao2022react} exemplifies how sophisticated LLM workflows can be systematically overfitted to evaluation benchmarks. ReAct combines reasoning traces with action-taking capabilities, enabling LLMs to interact with external tools while maintaining step-by-step reasoning.

\textbf{Development Process Analysis}: The ReAct paper reports evaluation on four benchmarks: HotpotQA for question answering, Fever for fact verification, ALFWorld for text-based games, and WebShop for web navigation. The development methodology involved iterative prompt engineering and workflow refinement specifically targeting these benchmarks.

The authors systematically tested different prompting strategies, comparing ReAct against reasoning-only (Chain-of-Thought) and acting-only baselines. Each iteration used performance on these specific benchmarks to guide design decisions about when to reason versus when to act, how to structure the observation-action loops, and how to integrate external tool usage.

\textbf{Overfitting Indicators}: Several aspects of the ReAct development suggest implicit training on evaluation data:
\begin{itemize}
\item The framework was designed after analyzing failure modes on the target benchmarks
\item Prompt engineering involved multiple iterations tested against the same evaluation sets
\item Tool integration strategies were optimized based on benchmark-specific requirements
\item The decision criteria for when to terminate reasoning loops were tuned for benchmark performance
\end{itemize}

\textbf{Generalization Concerns}: While ReAct shows impressive performance on the benchmarks used for development, questions remain about generalization to truly novel tasks that were not considered during the design process. The framework's reliance on specific prompting patterns and tool interactions may not transfer effectively to domains that differ significantly from the development benchmarks.

\subsection{AutoGPT: Autonomous Agent Development}

AutoGPT \citep{zhang2023autogpt} represents one of the first widely accessible autonomous agent frameworks, showcasing the challenges of evaluating complex agentic systems. Released in March 2023, AutoGPT quickly gained popularity but also demonstrated the evaluation challenges inherent in autonomous systems.

\textbf{Development and Benchmarking}: AutoGPT's development involved extensive community experimentation across various tasks, from software development to market research. The AutoGPT project established its own benchmarking framework \citep{autogpt2023benchmark}, designed to evaluate agent performance across categories like code generation, retrieval, memory, and safety.

However, this benchmarking approach faces the same overfitting challenges we identify. The benchmark tasks were developed concurrently with the agent framework, and community contributions often focused on improving performance on these specific evaluation criteria.

\textbf{Community-Driven Overfitting}: The open-source nature of AutoGPT created a unique form of distributed overfitting. As community members shared successful prompting strategies and workflow patterns, these approaches became optimized for the types of tasks commonly discussed and evaluated within the AutoGPT ecosystem.

\textbf{Real-World Performance Gaps}: Despite impressive performance on community benchmarks, AutoGPT users frequently reported significant performance gaps in real-world deployment scenarios. The framework's tendency toward infinite loops and inability to handle novel scenarios suggest that benchmark performance may not reflect genuine autonomous capabilities.

\subsection{Scientific Workflow Development}

LLM workflows in scientific domains present particularly concerning examples of potential overfitting, given the high stakes of scientific accuracy and reproducibility.

\textbf{Chemistry and Biology Applications}: Recent work on scientific reasoning systems often involves iterative development against established scientific benchmarks. Teams develop multi-step reasoning workflows, optimize tool usage for literature search and data analysis, and refine agent communication patterns based on performance on scientific datasets.

However, many of these scientific benchmarks contain problems that have been widely studied and discussed in the scientific literature. LLMs trained on scientific corpora may have indirect exposure to these problems, and workflow optimization against such benchmarks may not reflect genuine scientific reasoning capabilities.

\textbf{Evaluation Challenges}: Scientific applications demand higher standards of accuracy and reliability than many other domains. Yet the evaluation practices for scientific LLM workflows often mirror those used in general-purpose AI research, potentially leading to overconfidence in systems that have been optimized against known scientific problems rather than tested on truly novel research challenges.

\section{Comprehensive Framework for Addressing Workflow Overfitting}

\subsection{Data Separation Protocols}

\textbf{Temporal Isolation}: Establish evaluation benchmarks using data that postdates the workflow development period. This prevents any possibility of optimization against evaluation data, even indirect optimization through community knowledge transfer.

\textbf{Domain Isolation}: Develop workflows in one domain and evaluate on structurally similar but content-distinct domains. For example, develop reasoning workflows on history datasets and evaluate on scientific reasoning tasks with similar logical structures but different content areas.

\textbf{Multi-Level Holdouts}: Implement nested holdout strategies where:
\begin{itemize}
\item Level 1: Component-level evaluation using standard ML holdout practices
\item Level 2: Workflow-level evaluation on integration benchmarks never used during development
\item Level 3: System-level evaluation on deployment scenarios that differ systematically from development contexts
\end{itemize}

\subsection{Methodology Transparency Requirements}

\textbf{Development History Documentation}: Require comprehensive documentation of:
\begin{itemize}
\item All datasets accessed during development, including for prompt engineering
\item Number of iterations performed against each benchmark
\item Specific optimization targets that guided design decisions
\item Community knowledge and published work consulted during development
\end{itemize}

\textbf{Contamination Auditing}: Establish systematic procedures for identifying potential contamination sources:
\begin{itemize}
\item Analysis of training data overlap with evaluation sets
\item Documentation of published work that may have influenced design decisions
\item Assessment of community knowledge transfer through forums, papers, and code repositories
\end{itemize}

\textbf{Preregistration Protocols}: Adapt preregistration practices from experimental psychology and medical research to LLM workflow development:
\begin{itemize}
\item Declare evaluation benchmarks before beginning development
\item Specify success criteria and evaluation protocols in advance
\item Commit to reporting results regardless of outcome
\end{itemize}

\subsection{Community Infrastructure}

\textbf{Benchmark Lifecycle Management}: Establish systematic protocols for benchmark retirement and replacement:
\begin{itemize}
\item Monitor community optimization pressure on specific benchmarks
\item Implement automatic retirement triggers based on performance saturation
\item Develop systematic approaches for creating replacement benchmarks that maintain evaluation validity
\end{itemize}

\textbf{Shared Holdout Resources}: Create community-managed evaluation infrastructure:
\begin{itemize}
\item Third-party evaluation services that maintain truly independent benchmarks
\item Collaborative evaluation protocols that prevent individual teams from accessing evaluation data during development
\item Standardized reporting formats that enable meta-analysis across different workflow approaches
\end{itemize}

\textbf{Incentive Alignment}: Restructure research incentives to reward methodological rigor:
\begin{itemize}
\item Conference and journal policies that require methodology transparency
\item Recognition for negative results and replication studies
\item Funding priorities that emphasize evaluation methodology alongside technical innovation
\end{itemize}

\section{Why This Matters for Scientific Applications}

The overfitting problem is particularly concerning for scientific applications. Scientific research demands reproducible results, but if LLM workflows are optimized against the same benchmarks they're evaluated on, reported performance may not generalize to real scientific problems.

Overfitted workflows may perform well on known benchmarks but fail on novel scientific challenges, leading to misplaced confidence in AI capabilities for scientific discovery. If reported performance is inflated due to overfitting, research funding and effort may be misdirected toward approaches that don't actually advance scientific capability.

The stakes are particularly high in scientific domains where accuracy and reliability are paramount, and where the cost of false confidence can impede genuine scientific progress.

\section{Proposed Solutions}

\subsection{Strict Data Separation}

We propose establishing truly independent evaluation datasets that are never used during development, implementing time-based data splits to ensure evaluation data represents future scenarios, and conducting third-party evaluation on datasets that development teams have never accessed.

\subsection{Methodology Transparency}

Teams should document which datasets were used during development, including prompt engineering and architecture decisions. Clear reporting of what metrics and datasets drove workflow design decisions is essential. Documentation of how many iterations and what forms of optimization were performed against each benchmark should be standard practice.

\subsection{Evaluation Best Practices}

Workflows should be tested on diverse tasks and domains, not just those used for development. Evaluation scenarios should be designed that specifically challenge the optimization biases introduced during development. Performance should be reported on actual deployment scenarios, not just benchmark tasks.

\subsection{Community Standards}

The community should establish protocols for when benchmarks should be retired or replaced to prevent over-optimization. Shared holdout resources that remain independent of development processes should be created. Journals and conferences should require clear documentation of data usage during development.

\section{Discussion and Implications}

Unlike traditional ML overfitting, which typically affected single models, LLM workflow overfitting can contaminate entire research directions. When the community collectively optimizes against the same benchmarks, the cumulative effect creates systematic bias across the field.

The current practice conflates innovation (developing better workflows) with evaluation (measuring progress). True evaluation requires independence from the development process. As LLM workflows become more sophisticated and are deployed in critical applications, the cost of overfitted performance becomes higher.

The scientific community cannot afford to repeat ML's early mistakes, especially when the applications involve high-stakes scientific discovery and decision-making.

\section{Conclusion}

The LLM research community stands at a critical juncture. The sophistication of modern workflows has obscured a fundamental methodological problem: we are systematically reporting results on data that has been used for optimization. This mirrors the overfitting crisis that plagued early machine learning research.

Unlike traditional ML, where overfitting affected individual models, LLM workflow overfitting can contaminate entire research directions and mislead the community about actual capabilities. This is particularly concerning for scientific applications, where accuracy and reliability are paramount.

The solution requires collective action: establishing strict data separation protocols, demanding transparency in development processes, and creating truly independent evaluation resources. The ML community learned these lessons decades ago. The LLM community must learn them now, before the credibility of AI research is further undermined.

We call on the research community to adopt rigorous evaluation standards that separate development from testing, just as was necessary in traditional ML. Only through such methodological rigor can we ensure that reported advances represent genuine progress rather than sophisticated forms of overfitting.

\bibliographystyle{plainnat}
\bibliography{overfitting_llm}

\newpage

\section*{Agents4Science AI Involvement Checklist}

\begin{enumerate}
    \item \textbf{Hypothesis development}: The research topic and hypothesis that LLM workflows suffer from systematic overfitting similar to early ML practices was jointly developed, with the AI agent providing the detailed conceptual framework and historical parallels.

    Answer: \textbf{[B] Mostly human, assisted by AI}
    
    Explanation: The human co-author identified the core problem and made the connection to historical ML overfitting. The AI agent developed the detailed hypothesis, literature connections, and theoretical framework for understanding the problem.

    \item \textbf{Experimental design and implementation}: This is a position paper that does not include empirical experiments. The analysis relies on synthesizing existing literature and making methodological arguments.

    Answer: \textbf{[NA] Not Applicable}
    
    Explanation: No experiments were conducted. The paper makes its argument through literature analysis, theoretical reasoning, and methodological critique of current practices.

    \item \textbf{Analysis of data and interpretation of results}: The paper analyzes existing literature and research practices rather than experimental data. The AI agent conducted literature search and synthesis.

    Answer: \textbf{[C] Mostly AI, assisted by human}
    
    Explanation: The AI agent conducted web searches for relevant literature, identified key papers on data contamination and benchmark leakage, and synthesized findings to support the core arguments. Human provided guidance on focus areas.

    \item \textbf{Writing}: The AI agent was responsible for the majority of the writing, including structure, arguments, and academic formatting.

    Answer: \textbf{[C] Mostly AI, assisted by human}
    
    Explanation: The AI agent wrote the complete paper including abstract, introduction, literature review, arguments, and conclusions. The human co-author provided feedback, guidance on emphasis areas, and editorial input.

    \item \textbf{Observed AI Limitations}: The AI agent required guidance on specific examples and emphasis areas, and needed human oversight to ensure arguments remained grounded and credible.

    Description: The AI agent occasionally needed direction on which aspects of the overfitting problem to emphasize most strongly. The human co-author's role was crucial in keeping the argument focused and ensuring it addressed the most important methodological concerns in the field.
\end{enumerate}

\newpage

\section*{Agents4Science Paper Checklist}

\begin{enumerate}

\item \textbf{Claims}
    \item[] Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: The abstract and introduction clearly state this is a position paper arguing that LLM workflows suffer from systematic overfitting, with claims appropriately scoped to methodological critique rather than empirical findings.

\item \textbf{Limitations}
    \item[] Question: Does the paper discuss the limitations of the work performed by the authors?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: Section 7 discusses the scope of the argument and acknowledges this is a methodological position paper without empirical validation of proposed solutions.

\item \textbf{Theory assumptions and proofs}
    \item[] Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: This is a position paper that makes methodological arguments rather than theoretical claims requiring formal proofs.

\item \textbf{Experimental result reproducibility}
    \item[] Question: Does the paper fully disclose all the information needed to reproduce the main experimental results?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: This paper does not include experiments. It is a methodological position paper based on literature analysis and reasoned argument.

\item \textbf{Open access to data and code}
    \item[] Question: Does the paper provide open access to the data and code?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: No experiments or code are involved in this position paper.

\item \textbf{Experimental setting/details}
    \item[] Question: Does the paper specify all training and test details necessary to understand the results?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: This paper does not include experiments.

\item \textbf{Experiment statistical significance}
    \item[] Question: Does the paper report error bars or statistical significance information?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: No experiments are conducted in this position paper.

\item \textbf{Experiments compute resources}
    \item[] Question: Does the paper provide sufficient information on computer resources needed?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: No experiments requiring computational resources were conducted.
    
\item \textbf{Code of ethics}
    \item[] Question: Does the research conform with the Agents4Science Code of Ethics?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: This methodological critique aims to improve research practices and scientific integrity, fully conforming with ethical research standards.

\item \textbf{Broader impacts}
    \item[] Question: Does the paper discuss potential positive and negative societal impacts?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: Section 5 discusses implications for scientific applications and the risks of methodological failures in high-stakes research domains.

\end{enumerate}

\end{document}