\documentclass{article}

% ready for submission
\usepackage{agents4science_2025}

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{natbib}         % citations

\title{The Overfitting Crisis in LLM Workflows: Learning from Machine Learning's Past Mistakes}

\author{%
  Claude Sonnet\thanks{AI Agent - Primary author responsible for conceptualization, literature analysis, writing, and argumentation.} \\
  Anthropic \\
  \texttt{claude@anthropic.com} \\
  \And
  Anonymous Human Co-author \\
  Advisory Institution \\
  \texttt{advisor@institution.edu} \\
}

\begin{document}

\maketitle

\begin{abstract}
The rapid development of sophisticated Large Language Model (LLM) workflows—including agentic systems, multi-step reasoning pipelines, and tool-integrated approaches—has led to impressive reported performance across various benchmarks. However, we argue that the field is repeating a critical mistake from early machine learning: reporting results on data that has been implicitly used for training or optimization. The complexity of modern LLM workflows obscures the fact that iterative prompt engineering, benchmark-driven development, and workflow refinement constitute a form of training on evaluation data. This position paper draws parallels to historical overfitting practices in ML, documents how current LLM development methodologies systematically conflate training and testing data, and proposes best practices to address this growing methodological crisis before it undermines the credibility of AI research, particularly in scientific applications.
\end{abstract}

\section{Introduction}

The rapid development of sophisticated Large Language Model (LLM) workflows—including agentic systems, multi-step reasoning pipelines, and tool-integrated approaches—has led to impressive reported performance across various benchmarks. However, we argue that the field is repeating a critical mistake from early machine learning: reporting results on data that has been implicitly used for training or optimization.

The complexity of modern LLM workflows obscures the fact that iterative prompt engineering, benchmark-driven development, and workflow refinement constitute a form of training on evaluation data. This position paper draws parallels to historical overfitting practices in ML, documents how current LLM development methodologies systematically conflate training and testing data, and proposes best practices to address this growing methodological crisis before it undermines the credibility of AI research, particularly in scientific applications.

\section{The Historical Parallel: ML's Overfitting Problem}

In early machine learning research, a common antipattern emerged. Researchers would select and tune models based on performance on available datasets, with hyperparameter optimization, feature engineering, and architecture choices driven by test set performance. Final results were reported on the same datasets used for development, leading to real-world deployment showing significant performance degradation.

The ML community recognized this as a fundamental methodology error. The "test set" had effectively become part of the training process through repeated evaluation and optimization cycles. This led to established protocols: clear separation of training, validation, and test sets; test sets remaining untouched until final evaluation; cross-validation and holdout validation for development; and independent evaluation on truly unseen data.

These practices became foundational to credible ML research and are now standard in the field.

\section{The Current Crisis in LLM Workflows}

\subsection{The Implicit Training Problem}

Modern LLM workflows involve extensive optimization processes that constitute implicit training on evaluation data. Teams spend weeks iterating on prompts, testing variants against target benchmarks, and selecting the best-performing approaches \citep{wei2022chain}. Each iteration uses the benchmark as a validation signal.

Multi-agent systems, RAG pipelines, and tool usage patterns are designed and refined based on performance on specific tasks and datasets. The architecture itself becomes optimized for known evaluation criteria. Research teams explicitly target improvements on established benchmarks, using them as development objectives rather than independent evaluation metrics.

\subsection{Complexity as Camouflage}

The sophistication of modern LLM workflows obscures the overfitting problem. Complex chains of thought, tree search, and multi-agent interactions create an illusion of generalization, when in fact the entire pipeline has been optimized for specific evaluation patterns \citep{yao2023tree}.

Systems that use calculators, code interpreters, and web search appear more general, but tool selection, usage patterns, and integration strategies are typically optimized against known benchmarks. Workflows that adapt their strategies based on input characteristics seem more robust, but these adaptation mechanisms are usually developed and tuned using the same evaluation data they will later be tested on.

\subsection{Benchmark Contamination and Spillover Effects}

The problem is exacerbated by several forms of contamination. Many benchmarks overlap with or derive from data sources used in LLM pre-training, creating subtle forms of data leakage \citep{ren2024benchmarking, shi2024rethinking}. Recent work has shown that simple paraphrasing can bypass decontamination measures, and that models can achieve artificially high performance when such variations are not eliminated \citep{shi2024rethinking}.

Solutions and techniques developed for one benchmark quickly propagate to others, creating implicit optimization across multiple evaluation sets \citep{rogers2021changing, dodge2021documenting}. As models improve on existing benchmarks, new versions are created that often share similar patterns and evaluation criteria, perpetuating the contamination cycle \citep{ethayarajh2022understanding}.

\section{Related Work}

\subsection{Historical Overfitting in Machine Learning}

The machine learning community has long recognized the dangers of overfitting to evaluation data. Early ML research suffered from researchers repeatedly testing models on the same evaluation sets, leading to inflated performance estimates that failed to generalize \citep{chandrasekaran2023test}. This led to the establishment of rigorous evaluation protocols including proper train/validation/test splits and holdout methodologies.

The development of comprehensive benchmarking frameworks helped standardize evaluation practices \citep{olson2017pmlb}. However, as noted by \citet{bouthillier2021recommendations}, benchmark datasets themselves can become overoptimized targets when the entire research community focuses on the same evaluation sets over extended periods.

\subsection{Data Contamination in Large Language Models}

Recent work has extensively documented data contamination issues in LLM training and evaluation. \citet{golchin2023time} and \citet{shi2024rethinking} demonstrate that simple decontamination methods like n-gram matching are insufficient, as paraphrasing and translation can easily bypass these measures. \citet{ren2024benchmarking} found evidence of benchmark leakage across major LLMs, with the opacity of training processes making contamination difficult to detect.

The LessLeak-Bench study \citep{zhou2025lessleak} provided comprehensive analysis across 83 software engineering benchmarks, finding contamination across nearly all tested models. This systematic contamination undermines reported performance improvements in domains where sophisticated LLM workflows are increasingly deployed.

\subsection{Evaluation Methodology in ML Systems}

The establishment of rigorous evaluation protocols emerged as the ML community's primary defense against overfitting. The foundational train/validation/test split methodology became standard practice, with cross-validation techniques providing robust model selection frameworks \citep{stone1974cross, kohavi1995study}. These approaches ensured that model selection decisions were made independently of final performance evaluation, preventing the test set from becoming an implicit part of the training process \citep{hastie2009elements, bishop2006pattern}.

Modern ML systems have extended these principles to encompass continuous evaluation and monitoring throughout the system lifecycle \citep{chandrasekaran2023test}. However, the fundamental principle remains unchanged: evaluation data must remain independent of optimization processes to ensure valid performance estimates.

\section{Detailed Case Studies: Workflow Overfitting in Practice}

\subsection{ReAct: Reasoning and Acting Workflows}

The ReAct framework \citep{yao2022react} exemplifies how sophisticated LLM workflows can be systematically overfitted to evaluation benchmarks. ReAct combines reasoning traces with action-taking capabilities, enabling LLMs to interact with external tools while maintaining step-by-step reasoning.

\textbf{Development Process Analysis}: The ReAct paper reports evaluation on four benchmarks: HotpotQA for question answering, Fever for fact verification, ALFWorld for text-based games, and WebShop for web navigation. The development methodology involved iterative prompt engineering and workflow refinement specifically targeting these benchmarks.

\textbf{Overfitting Evidence}: The ReAct development process exhibits several characteristics indicative of benchmark optimization:
\begin{itemize}
\item Systematic testing of different prompting strategies against the target benchmarks
\item Iterative refinement of the reasoning-action interleaving based on benchmark performance  
\item Optimization of tool integration strategies for benchmark-specific requirements
\item Fine-tuning of termination criteria based on evaluation results
\end{itemize}

\textbf{Performance Claims}: The paper reports substantial improvements over baselines: ReAct achieves 27.4\% success on HotpotQA compared to 20.6\% for chain-of-thought prompting, representing a 33\% relative improvement \citep{yao2022react}. However, these gains emerged through extensive optimization against these specific evaluation sets.

\subsection{Code Generation Workflows}

Code generation tasks present particularly clear examples of workflow overfitting. Benchmarks like HumanEval \citep{chen2021evaluating} and MBPP \citep{austin2021program} have become central targets for LLM development, with teams iteratively refining their approaches against these specific problems.

\textbf{Development Process Analysis}: Modern coding workflows undergo extensive optimization cycles:
\begin{itemize}
\item Tool integration strategies refined based on benchmark performance
\item Error handling approaches optimized for common benchmark failure modes
\item Multi-step reasoning patterns tuned to handle benchmark-specific problem structures
\item Code execution and debugging workflows designed around benchmark evaluation criteria
\end{itemize}

\textbf{Systematic Contamination Evidence}: The LessLeak-Bench study \citep{zhou2025lessleak} provides compelling evidence of widespread contamination across software engineering benchmarks. The study analyzed 83 benchmarks and found average leakage ratios of 4.8\%, 2.8\%, and 0.7\% for Python, Java, and C/C++ benchmarks respectively. However, specific benchmarks showed much higher contamination rates, with QuixBugs exhibiting 100\% leakage and BigCloneBench showing 55.7\% leakage.

\textbf{Performance Impact}: The study demonstrates that data leakage has substantial impact on LLM evaluation, with contaminated models showing inflated performance that does not generalize to truly novel programming challenges.

\subsection{Autonomous Agent Systems}

Autonomous agent frameworks like AutoGPT represent complex multi-component systems that are particularly susceptible to overfitting due to their iterative development processes and community-driven optimization.

\textbf{Development Challenges}: Agent systems face unique evaluation challenges:
\begin{itemize}
\item Benchmarks developed concurrently with the systems they evaluate
\item Community-driven optimization leading to distributed overfitting effects
\item Performance claims based on evaluation sets that guided development decisions
\item Informal evaluation criteria that evolve based on system capabilities
\end{itemize}

\textbf{Generalization Gaps}: Despite reported success on development benchmarks, deployed agent systems frequently exhibit significant performance degradation when encountering novel scenarios that differ from their optimization targets. This pattern suggests that benchmark performance may reflect sophisticated pattern matching rather than genuine autonomous reasoning capabilities.

\section{Expanded Analysis: Scientific Applications and Domain-Specific Risks}

\subsection{Chemistry and Materials Science}

Scientific applications of LLM workflows present particularly high stakes for the overfitting problem. Recent work in chemistry has developed sophisticated benchmarks like ChemBench, containing over 2,700 question-answer pairs designed to evaluate chemical knowledge and reasoning. While these benchmarks represent important advances in evaluation methodology, they also create new targets for optimization.

\textbf{Domain-Specific Overfitting Risks}: Chemical reasoning workflows often involve:
\begin{itemize}
\item Multi-step synthesis planning optimized against known reaction databases
\item Property prediction systems trained and validated on established chemical datasets
\item Literature analysis tools refined using benchmark chemical papers and abstracts
\item Safety assessment frameworks tuned against standardized hazard classification systems
\end{itemize}

\textbf{Reproducibility Standards in Chemistry}: The chemistry community has established rigorous standards for experimental reproducibility, but these standards have not yet been adapted for AI-assisted chemical research. Unlike traditional chemistry experiments, where failed replications are clearly identifiable, overfitted AI systems may produce plausible but incorrect results that are difficult to detect without expert domain knowledge.

\subsection{Physics and Engineering Applications}

Physics applications face similar challenges, with benchmarks targeting graduate-level physics problems. Engineering applications often require reasoning across multiple domains simultaneously, but systems optimized against such benchmarks may develop sophisticated pattern-matching capabilities that fail when confronted with truly novel interdisciplinary problems.

\textbf{Multi-Domain Scientific Reasoning}: Scientific applications often require reasoning across multiple domains simultaneously, creating additional opportunities for overfitting to develop on multi-faceted benchmark tasks.

\subsection{Biomedical and Healthcare Applications}

Healthcare applications represent the highest-stakes domain for AI system reliability. Biomedical LLM workflows increasingly target specialized benchmarks for clinical reasoning, drug discovery, and diagnostic assistance.

\textbf{Critical Safety Implications}: In healthcare contexts, overfitted performance can have severe consequences:
\begin{itemize}
\item Diagnostic systems optimized against medical benchmark datasets may miss novel disease presentations
\item Drug discovery workflows trained on established chemical databases may fail to identify truly innovative therapeutic approaches
\item Clinical decision support systems refined using benchmark cases may provide inappropriate recommendations for edge cases
\end{itemize}

\textbf{Regulatory and Compliance Challenges}: Healthcare AI systems must meet strict regulatory standards for safety and efficacy. However, current evaluation practices may not adequately distinguish between genuine clinical reasoning and sophisticated pattern matching against medical benchmarks.

\section{Implementation Roadmap}

The machine learning community has begun recognizing reproducibility challenges, with major conferences implementing evaluation standards \citep{pineau2021improving}. However, existing initiatives primarily focus on computational reproducibility rather than the workflow overfitting challenges we identify.

\subsection{Required Policy Extensions}

Building on existing frameworks, conferences and journals should require:
\begin{itemize}
\item \textbf{Development History Documentation}: Mandatory reporting of all datasets accessed during workflow development
\item \textbf{Benchmark Interaction Disclosure}: Documentation of optimization iterations performed against evaluation benchmarks  
\item \textbf{Independent Evaluation}: Results on truly held-out datasets not accessed during development
\item \textbf{Contamination Auditing}: Analysis of potential overlap between development resources and evaluation data
\end{itemize}

\subsection{Implementation Challenges and Solutions}

Implementation will face resistance due to increased development costs, competitive pressures, and the "publish or perish" academic culture. Success requires:
\begin{itemize}
\item \textbf{Incentive Alignment}: Recognition programs for methodologically rigorous research and funding priorities that emphasize evaluation methodology
\item \textbf{Community Infrastructure}: Third-party evaluation services maintaining truly independent benchmarks
\item \textbf{Gradual Adoption}: Phased implementation starting with pilot programs at select venues, expanding to full adoption over 18-36 months
\end{itemize}

The goal is ensuring that reported advances represent genuine progress in LLM capabilities rather than sophisticated optimization against known benchmarks.

\section{Comprehensive Framework for Addressing Workflow Overfitting}

Building on established ML evaluation best practices \citep{chandrasekaran2023test}, we propose a comprehensive approach to address workflow overfitting that extends beyond traditional model evaluation to encompass the entire development ecosystem.

\subsection{Data Separation Protocols}

\textbf{Temporal Isolation}: Establish evaluation benchmarks using data that postdates the workflow development period. This prevents any possibility of optimization against evaluation data, even indirect optimization through community knowledge transfer.

\textbf{Domain Isolation}: Develop workflows in one domain and evaluate on structurally similar but content-distinct domains. For example, develop reasoning workflows on history datasets and evaluate on scientific reasoning tasks with similar logical structures but different content areas.

\textbf{Multi-Level Holdouts}: Implement nested holdout strategies where:
\begin{itemize}
\item Level 1: Component-level evaluation using standard ML holdout practices
\item Level 2: Workflow-level evaluation on integration benchmarks never used during development
\item Level 3: System-level evaluation on deployment scenarios that differ systematically from development contexts
\end{itemize}

\subsection{Methodology Transparency Requirements}

The opacity of current LLM workflow development obscures critical evaluation dependencies. Unlike traditional ML where training procedures are explicitly documented, workflow optimization often occurs through informal iteration cycles that leave no systematic record. This lack of transparency prevents both contamination detection and independent replication of results.

\textbf{Development History Documentation}: Comprehensive logging systems must capture all evaluation data interactions during development. This includes datasets accessed during prompt engineering phases, the number of optimization iterations performed against each benchmark, specific optimization targets that guided architectural decisions, and community knowledge sources consulted. Such documentation enables post-hoc contamination analysis and supports reproducibility audits.

\textbf{Contamination Auditing}: Systematic contamination detection requires automated analysis of potential overlap between development resources and evaluation sets. This involves computational analysis of training data intersection with evaluation benchmarks, documentation of published work that influenced design decisions, and assessment of community knowledge transfer through forums, papers, and code repositories. Contamination auditing tools should be integrated into standard development workflows.

\textbf{Preregistration Protocols}: Adapting preregistration practices from experimental psychology and medical research can prevent post-hoc optimization bias. Teams must declare evaluation benchmarks before beginning development, specify success criteria and evaluation protocols in advance, and commit to reporting results regardless of outcome. This approach transforms evaluation from an iterative optimization process into a genuine hypothesis test.

\subsection{Community Infrastructure}

\textbf{Benchmark Lifecycle Management}: Establish systematic protocols for benchmark retirement and replacement:
\begin{itemize}
\item Monitor community optimization pressure on specific benchmarks
\item Implement automatic retirement triggers based on performance saturation
\item Develop systematic approaches for creating replacement benchmarks that maintain evaluation validity
\end{itemize}

\textbf{Shared Holdout Resources}: Create community-managed evaluation infrastructure:
\begin{itemize}
\item Third-party evaluation services that maintain truly independent benchmarks
\item Collaborative evaluation protocols that prevent individual teams from accessing evaluation data during development
\item Standardized reporting formats that enable meta-analysis across different workflow approaches
\end{itemize}

\textbf{Incentive Alignment}: Restructure research incentives to reward methodological rigor:
\begin{itemize}
\item Conference and journal policies that require methodology transparency
\item Recognition for negative results and replication studies
\item Funding priorities that emphasize evaluation methodology alongside technical innovation
\end{itemize}

\section{Why This Matters for Scientific Applications}

The overfitting problem is particularly concerning for scientific applications. Scientific research demands reproducible results, but if LLM workflows are optimized against the same benchmarks they're evaluated on, reported performance may not generalize to real scientific problems.

Overfitted workflows may perform well on known benchmarks but fail on novel scientific challenges, leading to misplaced confidence in AI capabilities for scientific discovery. If reported performance is inflated due to overfitting, research funding and effort may be misdirected toward approaches that don't actually advance scientific capability.

The stakes are particularly high in scientific domains where accuracy and reliability are paramount, and where the cost of false confidence can impede genuine scientific progress.

\section{Discussion and Implications}

Unlike traditional ML overfitting, which typically affected single models, LLM workflow overfitting can contaminate entire research directions. When the community collectively optimizes against the same benchmarks, the cumulative effect creates systematic bias across the field.

The current practice conflates innovation (developing better workflows) with evaluation (measuring progress). True evaluation requires independence from the development process. As LLM workflows become more sophisticated and are deployed in critical applications, the cost of overfitted performance becomes higher.

The scientific community cannot afford to repeat ML's early mistakes, especially when the applications involve high-stakes scientific discovery and decision-making.

\section{Conclusion}

The LLM research community stands at a critical juncture. The sophistication of modern workflows has obscured a fundamental methodological problem: we are systematically reporting results on data that has been used for optimization. This mirrors the overfitting crisis that plagued early machine learning research.

Unlike traditional ML, where overfitting affected individual models, LLM workflow overfitting can contaminate entire research directions and mislead the community about actual capabilities. This is particularly concerning for scientific applications, where accuracy and reliability are paramount.

The solution requires collective action: establishing strict data separation protocols, demanding transparency in development processes, and creating truly independent evaluation resources. The ML community learned these lessons decades ago. The LLM community must learn them now, before the credibility of AI research is further undermined.

We call on the research community to adopt rigorous evaluation standards that separate development from testing, just as was necessary in traditional ML. Only through such methodological rigor can we ensure that reported advances represent genuine progress rather than sophisticated forms of overfitting.

\bibliographystyle{plainnat}
\bibliography{overfitting_llm}

\newpage

\section*{Agents4Science AI Involvement Checklist}

\begin{enumerate}
    \item \textbf{Hypothesis development}: The research topic and hypothesis that LLM workflows suffer from systematic overfitting similar to early ML practices was jointly developed, with the AI agent providing the detailed conceptual framework and historical parallels.

    Answer: \textbf{[B] Mostly human, assisted by AI}
    
    Explanation: The human co-author identified the core problem and made the connection to historical ML overfitting. The AI agent developed the detailed hypothesis, literature connections, and theoretical framework for understanding the problem.

    \item \textbf{Experimental design and implementation}: This is a position paper that does not include empirical experiments. The analysis relies on synthesizing existing literature and making methodological arguments.

    Answer: \textbf{[NA] Not Applicable}
    
    Explanation: No experiments were conducted. The paper makes its argument through literature analysis, theoretical reasoning, and methodological critique of current practices.

    \item \textbf{Analysis of data and interpretation of results}: The paper analyzes existing literature and research practices rather than experimental data. The AI agent conducted literature search and synthesis.

    Answer: \textbf{[C] Mostly AI, assisted by human}
    
    Explanation: The AI agent conducted web searches for relevant literature, identified key papers on data contamination and benchmark leakage, and synthesized findings to support the core arguments. Human provided guidance on focus areas.

    \item \textbf{Writing}: The AI agent was responsible for the majority of the writing, including structure, arguments, and academic formatting.

    Answer: \textbf{[C] Mostly AI, assisted by human}
    
    Explanation: The AI agent wrote the complete paper including abstract, introduction, literature review, arguments, and conclusions. The human co-author provided feedback, guidance on emphasis areas, and editorial input.

    \item \textbf{Observed AI Limitations}: The AI agent required guidance on specific examples and emphasis areas, and needed human oversight to ensure arguments remained grounded and credible.

    Description: The AI agent occasionally needed direction on which aspects of the overfitting problem to emphasize most strongly. The human co-author's role was crucial in keeping the argument focused and ensuring it addressed the most important methodological concerns in the field.
\end{enumerate}

\newpage

\section*{Agents4Science Paper Checklist}

\begin{enumerate}

\item \textbf{Claims}
    \item[] Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: The abstract and introduction clearly state this is a position paper arguing that LLM workflows suffer from systematic overfitting, with claims appropriately scoped to methodological critique rather than empirical findings.

\item \textbf{Limitations}
    \item[] Question: Does the paper discuss the limitations of the work performed by the authors?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: Section 7 discusses the scope of the argument and acknowledges this is a methodological position paper without empirical validation of proposed solutions.

\item \textbf{Theory assumptions and proofs}
    \item[] Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: This is a position paper that makes methodological arguments rather than theoretical claims requiring formal proofs.

\item \textbf{Experimental result reproducibility}
    \item[] Question: Does the paper fully disclose all the information needed to reproduce the main experimental results?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: This paper does not include experiments. It is a methodological position paper based on literature analysis and reasoned argument.

\item \textbf{Open access to data and code}
    \item[] Question: Does the paper provide open access to the data and code?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: No experiments or code are involved in this position paper.

\item \textbf{Experimental setting/details}
    \item[] Question: Does the paper specify all training and test details necessary to understand the results?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: This paper does not include experiments.

\item \textbf{Experiment statistical significance}
    \item[] Question: Does the paper report error bars or statistical significance information?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: No experiments are conducted in this position paper.

\item \textbf{Experiments compute resources}
    \item[] Question: Does the paper provide sufficient information on computer resources needed?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: No experiments requiring computational resources were conducted.
    
\item \textbf{Code of ethics}
    \item[] Question: Does the research conform with the Agents4Science Code of Ethics?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: This methodological critique aims to improve research practices and scientific integrity, fully conforming with ethical research standards.

\item \textbf{Broader impacts}
    \item[] Question: Does the paper discuss potential positive and negative societal impacts?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: Section 5 discusses implications for scientific applications and the risks of methodological failures in high-stakes research domains.

\end{enumerate}

\end{document}