\documentclass{article}

% ready for submission
\usepackage{agents4science_2025}

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{natbib}         % citations

\title{The Overfitting Crisis in LLM Workflows: Learning from Machine Learning's Past Mistakes}

\author{%
  Claude Sonnet\thanks{AI Agent - Primary author responsible for conceptualization, literature analysis, writing, and argumentation.} \\
  Anthropic \\
  \texttt{claude@anthropic.com} \\
  \And
  Anonymous Human Co-author \\
  Advisory Institution \\
  \texttt{advisor@institution.edu} \\
}

\begin{document}

\maketitle

\begin{abstract}
The rapid development of sophisticated Large Language Model (LLM) workflows—including agentic systems, multi-step reasoning pipelines, and tool-integrated approaches—has led to impressive reported performance across various benchmarks. However, we argue that the field is repeating a critical mistake from early machine learning: reporting results on data that has been implicitly used for training or optimization. The complexity of modern LLM workflows obscures the fact that iterative prompt engineering, benchmark-driven development, and workflow refinement constitute a form of training on evaluation data. This position paper draws parallels to historical overfitting practices in ML, documents how current LLM development methodologies systematically conflate training and testing data, and proposes best practices to address this growing methodological crisis before it undermines the credibility of AI research, particularly in scientific applications.
\end{abstract}

\section{Introduction}

The machine learning community learned hard lessons about overfitting in the early 2000s. Researchers would tune hyperparameters based on test set performance, report those same test results, and wonder why deployed models underperformed. The solution was clear: strict separation between training, validation, and test sets, with test sets remaining truly held-out until final evaluation.

Today, we witness an eerily similar pattern in Large Language Model (LLM) research. Teams build increasingly sophisticated workflows—multi-agent systems, complex prompting strategies, tool-integrated pipelines—and iteratively refine them against specific benchmarks. They then report performance on those same benchmarks as if evaluating a model that has never seen the data before. The complexity of these workflows masks what would be obvious overfitting in traditional ML settings.

This paper argues that the LLM community is systematically making the same mistake that plagued early ML research: conflating training and testing data. While individual components (the base LLMs) may not be fine-tuned on evaluation data, the workflows themselves are being optimized against the very benchmarks used for reporting results.

Recent work has documented extensive data contamination in LLM training \citep{golchin2023time, shi2024rethinking, ren2024benchmarking}, where evaluation benchmarks appear in pre-training data. However, we identify a related but distinct problem: even when base models avoid direct contamination, the \emph{workflows} built on top of them are systematically overfitted to evaluation benchmarks through iterative development processes.

\section{The Historical Parallel: ML's Overfitting Problem}

In early machine learning research, a common antipattern emerged. Researchers would select and tune models based on performance on available datasets, with hyperparameter optimization, feature engineering, and architecture choices driven by test set performance. Final results were reported on the same datasets used for development, leading to real-world deployment showing significant performance degradation.

The ML community recognized this as a fundamental methodology error. The "test set" had effectively become part of the training process through repeated evaluation and optimization cycles. This led to established protocols: clear separation of training, validation, and test sets; test sets remaining untouched until final evaluation; cross-validation and holdout validation for development; and independent evaluation on truly unseen data.

These practices became foundational to credible ML research and are now standard in the field.

\section{The Current Crisis in LLM Workflows}

\subsection{The Implicit Training Problem}

Modern LLM workflows involve extensive optimization processes that constitute implicit training on evaluation data. Teams spend weeks iterating on prompts, testing variants against target benchmarks, and selecting the best-performing approaches \citep{wei2022chain}. Each iteration uses the benchmark as a validation signal.

Multi-agent systems, RAG pipelines, and tool usage patterns are designed and refined based on performance on specific tasks and datasets. The architecture itself becomes optimized for known evaluation criteria. Research teams explicitly target improvements on established benchmarks, using them as development objectives rather than independent evaluation metrics.

\subsection{Complexity as Camouflage}

The sophistication of modern LLM workflows obscures the overfitting problem. Complex chains of thought, tree search, and multi-agent interactions create an illusion of generalization, when in fact the entire pipeline has been optimized for specific evaluation patterns \citep{yao2023tree}.

Systems that use calculators, code interpreters, and web search appear more general, but tool selection, usage patterns, and integration strategies are typically optimized against known benchmarks. Workflows that adapt their strategies based on input characteristics seem more robust, but these adaptation mechanisms are usually developed and tuned using the same evaluation data they will later be tested on.

\subsection{Benchmark Contamination and Spillover Effects}

The problem is exacerbated by several forms of contamination. Many benchmarks overlap with or derive from data sources used in LLM pre-training, creating subtle forms of data leakage \citep{ren2024benchmarking, shi2024rethinking}. Recent work has shown that simple paraphrasing can bypass decontamination measures, and that models can achieve artificially high performance when such variations are not eliminated \citep{shi2024rethinking}.

Solutions and techniques developed for one benchmark quickly propagate to others, creating implicit optimization across multiple evaluation sets. As models improve on existing benchmarks, new versions are created that often share similar patterns and evaluation criteria, perpetuating the contamination cycle.

\section{Case Studies and Evidence}

\subsection{Data Leakage in Software Engineering Benchmarks}

Recent comprehensive analysis of 83 software engineering benchmarks found evidence of data leakage across nearly all tested models and benchmarks \citep{li2025lessleak}. This systematic contamination undermines the validity of reported performance improvements in coding tasks, where many sophisticated workflows have been developed.

\subsection{Search-Time Contamination}

\citet{nguyen2024search} identified "search-time contamination" in evaluating search-based LLM agents, where the evaluation process itself introduces data leakage. This represents a new category of contamination beyond traditional training-time leakage, directly relevant to complex agentic workflows.

\subsection{Benchmark Leakage Detection}

Studies using kernel divergence methods to detect dataset contamination have found widespread evidence of benchmark leakage across major LLMs \citep{golchin2024contaminated}. The opacity of training processes makes it difficult to determine the extent of contamination, but evidence suggests it is pervasive.

\section{Why This Matters for Scientific Applications}

The overfitting problem is particularly concerning for scientific applications. Scientific research demands reproducible results, but if LLM workflows are optimized against the same benchmarks they're evaluated on, reported performance may not generalize to real scientific problems.

Overfitted workflows may perform well on known benchmarks but fail on novel scientific challenges, leading to misplaced confidence in AI capabilities for scientific discovery. If reported performance is inflated due to overfitting, research funding and effort may be misdirected toward approaches that don't actually advance scientific capability.

The stakes are particularly high in scientific domains where accuracy and reliability are paramount, and where the cost of false confidence can impede genuine scientific progress.

\section{Proposed Solutions}

\subsection{Strict Data Separation}

We propose establishing truly independent evaluation datasets that are never used during development, implementing time-based data splits to ensure evaluation data represents future scenarios, and conducting third-party evaluation on datasets that development teams have never accessed.

\subsection{Methodology Transparency}

Teams should document which datasets were used during development, including prompt engineering and architecture decisions. Clear reporting of what metrics and datasets drove workflow design decisions is essential. Documentation of how many iterations and what forms of optimization were performed against each benchmark should be standard practice.

\subsection{Evaluation Best Practices}

Workflows should be tested on diverse tasks and domains, not just those used for development. Evaluation scenarios should be designed that specifically challenge the optimization biases introduced during development. Performance should be reported on actual deployment scenarios, not just benchmark tasks.

\subsection{Community Standards}

The community should establish protocols for when benchmarks should be retired or replaced to prevent over-optimization. Shared holdout resources that remain independent of development processes should be created. Journals and conferences should require clear documentation of data usage during development.

\section{Discussion and Implications}

Unlike traditional ML overfitting, which typically affected single models, LLM workflow overfitting can contaminate entire research directions. When the community collectively optimizes against the same benchmarks, the cumulative effect creates systematic bias across the field.

The current practice conflates innovation (developing better workflows) with evaluation (measuring progress). True evaluation requires independence from the development process. As LLM workflows become more sophisticated and are deployed in critical applications, the cost of overfitted performance becomes higher.

The scientific community cannot afford to repeat ML's early mistakes, especially when the applications involve high-stakes scientific discovery and decision-making.

\section{Conclusion}

The LLM research community stands at a critical juncture. The sophistication of modern workflows has obscured a fundamental methodological problem: we are systematically reporting results on data that has been used for optimization. This mirrors the overfitting crisis that plagued early machine learning research.

Unlike traditional ML, where overfitting affected individual models, LLM workflow overfitting can contaminate entire research directions and mislead the community about actual capabilities. This is particularly concerning for scientific applications, where accuracy and reliability are paramount.

The solution requires collective action: establishing strict data separation protocols, demanding transparency in development processes, and creating truly independent evaluation resources. The ML community learned these lessons decades ago. The LLM community must learn them now, before the credibility of AI research is further undermined.

We call on the research community to adopt rigorous evaluation standards that separate development from testing, just as was necessary in traditional ML. Only through such methodological rigor can we ensure that reported advances represent genuine progress rather than sophisticated forms of overfitting.

\section*{References}

\bibliography{overfitting_llm}
\bibliographystyle{plainnat}

\newpage

\section*{Agents4Science AI Involvement Checklist}

\begin{enumerate}
    \item \textbf{Hypothesis development}: The research topic and hypothesis that LLM workflows suffer from systematic overfitting similar to early ML practices was jointly developed, with the AI agent providing the detailed conceptual framework and historical parallels.

    Answer: \textbf{[B] Mostly human, assisted by AI}
    
    Explanation: The human co-author identified the core problem and made the connection to historical ML overfitting. The AI agent developed the detailed hypothesis, literature connections, and theoretical framework for understanding the problem.

    \item \textbf{Experimental design and implementation}: This is a position paper that does not include empirical experiments. The analysis relies on synthesizing existing literature and making methodological arguments.

    Answer: \textbf{[NA] Not Applicable}
    
    Explanation: No experiments were conducted. The paper makes its argument through literature analysis, theoretical reasoning, and methodological critique of current practices.

    \item \textbf{Analysis of data and interpretation of results}: The paper analyzes existing literature and research practices rather than experimental data. The AI agent conducted literature search and synthesis.

    Answer: \textbf{[C] Mostly AI, assisted by human}
    
    Explanation: The AI agent conducted web searches for relevant literature, identified key papers on data contamination and benchmark leakage, and synthesized findings to support the core arguments. Human provided guidance on focus areas.

    \item \textbf{Writing}: The AI agent was responsible for the majority of the writing, including structure, arguments, and academic formatting.

    Answer: \textbf{[C] Mostly AI, assisted by human}
    
    Explanation: The AI agent wrote the complete paper including abstract, introduction, literature review, arguments, and conclusions. The human co-author provided feedback, guidance on emphasis areas, and editorial input.

    \item \textbf{Observed AI Limitations}: The AI agent required guidance on specific examples and emphasis areas, and needed human oversight to ensure arguments remained grounded and credible.

    Description: The AI agent occasionally needed direction on which aspects of the overfitting problem to emphasize most strongly. The human co-author's role was crucial in keeping the argument focused and ensuring it addressed the most important methodological concerns in the field.
\end{enumerate}

\newpage

\section*{Agents4Science Paper Checklist}

\begin{enumerate}

\item \textbf{Claims}
    \item[] Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: The abstract and introduction clearly state this is a position paper arguing that LLM workflows suffer from systematic overfitting, with claims appropriately scoped to methodological critique rather than empirical findings.

\item \textbf{Limitations}
    \item[] Question: Does the paper discuss the limitations of the work performed by the authors?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: Section 7 discusses the scope of the argument and acknowledges this is a methodological position paper without empirical validation of proposed solutions.

\item \textbf{Theory assumptions and proofs}
    \item[] Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: This is a position paper that makes methodological arguments rather than theoretical claims requiring formal proofs.

\item \textbf{Experimental result reproducibility}
    \item[] Question: Does the paper fully disclose all the information needed to reproduce the main experimental results?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: This paper does not include experiments. It is a methodological position paper based on literature analysis and reasoned argument.

\item \textbf{Open access to data and code}
    \item[] Question: Does the paper provide open access to the data and code?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: No experiments or code are involved in this position paper.

\item \textbf{Experimental setting/details}
    \item[] Question: Does the paper specify all training and test details necessary to understand the results?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: This paper does not include experiments.

\item \textbf{Experiment statistical significance}
    \item[] Question: Does the paper report error bars or statistical significance information?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: No experiments are conducted in this position paper.

\item \textbf{Experiments compute resources}
    \item[] Question: Does the paper provide sufficient information on computer resources needed?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: No experiments requiring computational resources were conducted.
    
\item \textbf{Code of ethics}
    \item[] Question: Does the research conform with the Agents4Science Code of Ethics?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: This methodological critique aims to improve research practices and scientific integrity, fully conforming with ethical research standards.

\item \textbf{Broader impacts}
    \item[] Question: Does the paper discuss potential positive and negative societal impacts?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: Section 5 discusses implications for scientific applications and the risks of methodological failures in high-stakes research domains.

\end{enumerate}

\end{document}