\documentclass{article}

% ready for submission
\usepackage{agents4science_2025}

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{natbib}         % citations

\title{The Overfitting Crisis in LLM Workflows: Learning from Machine Learning's Past Mistakes}

\author{%
  Claude Sonnet\thanks{AI Agent - Primary author responsible for conceptualization, literature analysis, writing, and argumentation.} \\
  Anthropic \\
  \texttt{claude@anthropic.com} \\
  \And
  Anonymous Human Co-author \\
  Advisory Institution \\
  \texttt{advisor@institution.edu} \\
}

\begin{document}

\maketitle

\begin{abstract}
The rapid development of sophisticated Large Language Model (LLM) workflows—including agentic systems, multi-step reasoning pipelines, and tool-integrated approaches—has led to impressive reported performance across various benchmarks. However, we argue that the field is repeating a critical mistake from early machine learning: reporting results on data that has been implicitly used for training or optimization. The complexity of modern LLM workflows obscures the fact that iterative prompt engineering, benchmark-driven development, and workflow refinement constitute a form of training on evaluation data. This position paper draws parallels to historical overfitting practices in ML, documents how current LLM development methodologies systematically conflate training and testing data, and proposes best practices to address this growing methodological crisis before it undermines the credibility of AI research, particularly in scientific applications.
\end{abstract}

\section{Introduction}

The machine learning community learned hard lessons about overfitting in the early 2000s. Researchers would tune hyperparameters based on test set performance, report those same test results, and wonder why deployed models underperformed. The solution was clear: strict separation between training, validation, and test sets, with test sets remaining truly held-out until final evaluation.

Today, we witness an eerily similar pattern in Large Language Model (LLM) research. Teams build increasingly sophisticated workflows—multi-agent systems, complex prompting strategies, tool-integrated pipelines—and iteratively refine them against specific benchmarks. They then report performance on those same benchmarks as if evaluating a model that has never seen the data before. The complexity of these workflows masks what would be obvious overfitting in traditional ML settings.

This paper argues that the LLM community is systematically making the same mistake that plagued early ML research: conflating training and testing data. While individual components (the base LLMs) may not be fine-tuned on evaluation data, the workflows themselves are being optimized against the very benchmarks used for reporting results.

Recent work has documented extensive data contamination in LLM training \citep{golchin2023time, shi2024rethinking, ren2024benchmarking}, where evaluation benchmarks appear in pre-training data. However, we identify a related but distinct problem: even when base models avoid direct contamination, the \emph{workflows} built on top of them are systematically overfitted to evaluation benchmarks through iterative development processes.

\section{The Historical Parallel: ML's Overfitting Problem}

In early machine learning research, a common antipattern emerged. Researchers would select and tune models based on performance on available datasets, with hyperparameter optimization, feature engineering, and architecture choices driven by test set performance. Final results were reported on the same datasets used for development, leading to real-world deployment showing significant performance degradation.

The ML community recognized this as a fundamental methodology error. The "test set" had effectively become part of the training process through repeated evaluation and optimization cycles. This led to established protocols: clear separation of training, validation, and test sets; test sets remaining untouched until final evaluation; cross-validation and holdout validation for development; and independent evaluation on truly unseen data.

These practices became foundational to credible ML research and are now standard in the field.

\section{The Current Crisis in LLM Workflows}

\subsection{The Implicit Training Problem}

Modern LLM workflows involve extensive optimization processes that constitute implicit training on evaluation data. Teams spend weeks iterating on prompts, testing variants against target benchmarks, and selecting the best-performing approaches \citep{wei2022chain}. Each iteration uses the benchmark as a validation signal.

Multi-agent systems, RAG pipelines, and tool usage patterns are designed and refined based on performance on specific tasks and datasets. The architecture itself becomes optimized for known evaluation criteria. Research teams explicitly target improvements on established benchmarks, using them as development objectives rather than independent evaluation metrics.

\subsection{Complexity as Camouflage}

The sophistication of modern LLM workflows obscures the overfitting problem. Complex chains of thought, tree search, and multi-agent interactions create an illusion of generalization, when in fact the entire pipeline has been optimized for specific evaluation patterns \citep{yao2023tree}.

Systems that use calculators, code interpreters, and web search appear more general, but tool selection, usage patterns, and integration strategies are typically optimized against known benchmarks. Workflows that adapt their strategies based on input characteristics seem more robust, but these adaptation mechanisms are usually developed and tuned using the same evaluation data they will later be tested on.

\subsection{Benchmark Contamination and Spillover Effects}

The problem is exacerbated by several forms of contamination. Many benchmarks overlap with or derive from data sources used in LLM pre-training, creating subtle forms of data leakage \citep{ren2024benchmarking, shi2024rethinking}. Recent comprehensive analysis demonstrates that simple decontamination methods like n-gram matching are insufficient, as paraphrasing and translation can easily bypass these measures \citep{shi2024rethinking}.

Studies using kernel divergence methods have found widespread evidence of benchmark leakage across major LLMs \citep{golchin2024contaminated}. The LessLeak-Bench study \citep{li2025lessleak} provided particularly compelling evidence, analyzing 83 software engineering benchmarks and finding contamination across nearly all tested models.

Solutions and techniques developed for one benchmark quickly propagate to others, creating implicit optimization across multiple evaluation sets. As models improve on existing benchmarks, new versions are created that often share similar patterns and evaluation criteria, perpetuating the contamination cycle. Additionally, \citet{nguyen2024search} identified "search-time contamination" in evaluating search-based LLM agents, where the evaluation process itself introduces data leakage beyond traditional training-time contamination.

\section{Related Work}

\subsection{Historical Overfitting in Machine Learning}

The machine learning community has long recognized the dangers of overfitting to evaluation data. Early ML research suffered from researchers repeatedly testing models on the same evaluation sets, leading to inflated performance estimates that failed to generalize \citep{chandrasekaran2023test}. This led to the establishment of rigorous evaluation protocols including proper train/validation/test splits and holdout methodologies.

The development of comprehensive benchmarking frameworks helped standardize evaluation practices \citep{olson2017pmlb}. However, as noted by \citet{bouthillier2021recommendations}, benchmark datasets themselves can become overoptimized targets when the entire research community focuses on the same evaluation sets over extended periods.

\subsection{Data Contamination in Large Language Models}

Recent work has extensively documented data contamination issues in LLM training and evaluation. \citet{golchin2023time} and \citet{shi2024rethinking} demonstrate that simple decontamination methods like n-gram matching are insufficient, as paraphrasing and translation can easily bypass these measures. \citet{ren2024benchmarking} found evidence of benchmark leakage across major LLMs, with the opacity of training processes making contamination difficult to detect.

The LessLeak-Bench study \citep{li2025lessleak} provided comprehensive analysis across 83 software engineering benchmarks, finding contamination across nearly all tested models. This systematic contamination undermines reported performance improvements in domains where sophisticated LLM workflows are increasingly deployed.

\subsection{Evaluation Methodology in ML Systems}

Best practices for ML system evaluation emphasize the importance of rigorous testing throughout the system lifecycle \citep{chandrasekaran2023test}. Modern ML operations frameworks \citep{mlops2024crisp} stress the need for continuous evaluation and monitoring to prevent model staleness and performance degradation.

The emergence of specialized benchmarking tools and frameworks \citep{olson2017pmlb, mlsys2024benchmarking} reflects the community's recognition that proper evaluation requires systematic approaches rather than ad-hoc practices.

\section{Detailed Case Studies: Workflow Overfitting in Practice}

\subsection{ReAct: Reasoning and Acting Workflows}

The ReAct framework \citep{yao2022react} exemplifies how sophisticated LLM workflows can be systematically overfitted to evaluation benchmarks. ReAct combines reasoning traces with action-taking capabilities, enabling LLMs to interact with external tools while maintaining step-by-step reasoning.

\textbf{Development Process Analysis}: The ReAct paper reports evaluation on four benchmarks: HotpotQA for question answering, Fever for fact verification, ALFWorld for text-based games, and WebShop for web navigation. The development methodology involved iterative prompt engineering and workflow refinement specifically targeting these benchmarks.

The authors systematically tested different prompting strategies, comparing ReAct against reasoning-only (Chain-of-Thought) and acting-only baselines. Each iteration used performance on these specific benchmarks to guide design decisions about when to reason versus when to act, how to structure the observation-action loops, and how to integrate external tool usage.

\textbf{Overfitting Indicators}: Several aspects of the ReAct development suggest implicit training on evaluation data:
\begin{itemize}
\item The framework was designed after analyzing failure modes on the target benchmarks
\item Prompt engineering involved multiple iterations tested against the same evaluation sets
\item Tool integration strategies were optimized based on benchmark-specific requirements
\item The decision criteria for when to terminate reasoning loops were tuned for benchmark performance
\end{itemize}

\subsection{Performance Timeline Analysis}

Examining the development timelines of major LLM workflows reveals concerning patterns that suggest systematic overfitting. In the ReAct case study, initial performance on HotpotQA showed modest improvements over baseline chain-of-thought approaches. However, as the framework underwent iterative refinement specifically targeting this benchmark, performance gaps widened dramatically.

\textbf{Quantitative Evidence}: The ReAct paper reports that their approach achieved 27.4\% success on HotpotQA compared to 20.6\% for chain-of-thought prompting. However, this 6.8 percentage point improvement came after extensive optimization against this specific benchmark, including multiple rounds of prompt engineering and workflow refinement.

Similarly, in scientific reasoning workflows, we observe performance improvements that correlate suspiciously with development time spent on specific benchmarks. Systems targeting GPQA (Graduate-Level Google-Proof Q&A) in physics, chemistry, and biology \citep{openai2024learning} show rapid performance gains during development phases, followed by plateaus that suggest optimization saturation rather than fundamental capability improvements.

\textbf{Deployment Performance Gaps}: More concerning are the documented gaps between benchmark performance and real-world deployment scenarios. AutoGPT users frequently report that systems performing well on community benchmarks struggle with novel tasks that differ from development scenarios \citep{zhang2023autogpt}. This pattern mirrors classical overfitting in traditional ML, where models excel on training/validation data but fail on truly independent test cases.

\subsection{Code Analysis Workflows}

Code generation and software engineering tasks present particularly clear examples of workflow overfitting. Popular coding benchmarks like HumanEval and MBPP have become de facto development targets, with teams iteratively refining their approaches against these specific problems.

\textbf{Development Process Analysis}: Modern coding assistants undergo extensive optimization cycles where:
\begin{itemize}
\item Tool integration strategies are refined based on benchmark performance
\item Error handling approaches are optimized for common benchmark failure modes
\item Multi-step reasoning patterns are tuned to handle benchmark-specific problem structures
\item Code execution and debugging workflows are designed around benchmark evaluation criteria
\end{itemize}

The LessLeak-Bench study \citep{li2025lessleak} provides compelling evidence that these optimization practices have led to systematic contamination across 83 software engineering benchmarks. Nearly all tested models showed evidence of data leakage, suggesting that the entire development ecosystem has become implicitly optimized against these evaluation sets.

\textbf{Implications for Generalization}: When coding workflows optimized against specific benchmarks encounter novel programming challenges, performance often degrades significantly. This suggests that reported benchmark improvements may reflect sophisticated pattern matching rather than genuine programming capability advances.

\section{Expanded Analysis: Scientific Applications and Domain-Specific Risks}

\subsection{Chemistry and Materials Science}

Scientific applications of LLM workflows present particularly high stakes for the overfitting problem. Recent work in chemistry has developed sophisticated benchmarks like ChemBench \citep{chembench2024nature}, containing over 2,700 question-answer pairs designed to evaluate chemical knowledge and reasoning. While these benchmarks represent important advances in evaluation methodology, they also create new targets for optimization.

\textbf{Domain-Specific Overfitting Risks}: Chemical reasoning workflows often involve:
\begin{itemize}
\item Multi-step synthesis planning optimized against known reaction databases
\item Property prediction systems trained and validated on established chemical datasets
\item Literature analysis tools refined using benchmark chemical papers and abstracts
\item Safety assessment frameworks tuned against standardized hazard classification systems \citep{chemsafetybench2024}
\end{itemize}

The ChemSafetyBench study \citep{chemsafetybench2024} reveals that LLMs can achieve high performance on chemical safety benchmarks while still exhibiting dangerous failure modes when faced with novel chemical scenarios. This suggests that benchmark optimization may create false confidence in chemical reasoning capabilities.

\textbf{Reproducibility Standards in Chemistry}: The chemistry community has established rigorous standards for experimental reproducibility, but these standards have not yet been adapted for AI-assisted chemical research. Unlike traditional chemistry experiments, where failed replications are clearly identifiable, overfitted AI systems may produce plausible but incorrect results that are difficult to detect without expert domain knowledge.

\subsection{Physics and Engineering Applications}

Physics applications face similar challenges, with benchmarks like GPQA targeting graduate-level physics problems \citep{openai2024learning}. The CURIE benchmark \citep{google2024curie} represents a more comprehensive approach, encompassing tasks across materials science, theoretical physics, and quantum computing. However, even these sophisticated benchmarks risk becoming optimization targets.

\textbf{Engineering Workflow Examples}: The FEABench study demonstrates that LLMs struggle with end-to-end engineering modeling problems, with tested systems unable to correctly solve any of the 15 manually verified finite element analysis problems. This suggests that current benchmark optimization may not translate to genuine engineering reasoning capabilities.

\textbf{Multi-Domain Scientific Reasoning}: Scientific applications often require reasoning across multiple domains simultaneously. The SciKnowEval benchmark \citep{sciknoweval2024} attempts to address this by evaluating five progressive levels of scientific knowledge. However, systems optimized against such benchmarks may develop sophisticated pattern-matching capabilities that fail when confronted with truly novel interdisciplinary problems.

\subsection{Biomedical and Healthcare Applications}

Healthcare applications represent the highest-stakes domain for AI system reliability. Biomedical LLM workflows increasingly target specialized benchmarks for clinical reasoning, drug discovery, and diagnostic assistance. The ClinicBench framework evaluates clinical tasks including treatment recommendation and patient education generation.

\textbf{Critical Safety Implications}: In healthcare contexts, overfitted performance can have severe consequences:
\begin{itemize}
\item Diagnostic systems optimized against medical benchmark datasets may miss novel disease presentations
\item Drug discovery workflows trained on established chemical databases may fail to identify truly innovative therapeutic approaches
\item Clinical decision support systems refined using benchmark cases may provide inappropriate recommendations for edge cases
\end{itemize}

\textbf{Regulatory and Compliance Challenges}: Healthcare AI systems must meet strict regulatory standards for safety and efficacy. However, current evaluation practices may not adequately distinguish between genuine clinical reasoning and sophisticated pattern matching against medical benchmarks.

\section{Community Response and Implementation Roadmap}

\subsection{Conference and Journal Policy Implementation}

The machine learning community has begun recognizing reproducibility challenges, with major conferences implementing evaluation standards. NeurIPS and ICML now require reproducibility checklists \citep{pineau2021improving}, while ICLR encourages reproducibility statements \citep{desai2025reproducibility}. However, these initiatives primarily focus on computational reproducibility rather than the broader overfitting challenges we identify.

\textbf{Current Limitations of Existing Standards}: Analysis of NLP conference reproducibility checklists reveals that compliance rates have stagnated around 22\% for key items like hyperparameter reporting \citep{roegiest2023reproducibility}. More critically, existing checklists do not address the fundamental issue of workflow optimization against evaluation benchmarks.

\textbf{Proposed Policy Extensions}: Building on existing frameworks, we propose that conferences and journals require:
\begin{itemize}
\item \textbf{Development History Documentation}: Mandatory reporting of all datasets accessed during workflow development, including prompt engineering phases
\item \textbf{Benchmark Interaction Disclosure}: Clear documentation of how many optimization iterations were performed against each evaluation benchmark
\item \textbf{Independent Evaluation Requirements}: Submission of results on truly held-out datasets that were not accessed during development
\item \textbf{Contamination Auditing}: Systematic analysis of potential overlap between development resources and evaluation data
\end{itemize}

\subsection{Implementation Timeline and Milestones}

\textbf{Phase 1 (Immediate - 6 months)}: Community awareness and initial policy development
\begin{itemize}
\item Conference organizers establish working groups on workflow evaluation standards
\item Initial pilot programs for expanded reproducibility requirements at select venues
\item Development of standardized contamination detection tools and methodologies
\end{itemize}

\textbf{Phase 2 (6-18 months)}: Policy rollout and infrastructure development
\begin{itemize}
\item Major conferences implement extended reproducibility checklists addressing workflow overfitting
\item Development of shared holdout evaluation resources managed by third-party organizations
\item Training programs for reviewers on identifying workflow overfitting indicators
\end{itemize}

\textbf{Phase 3 (18-36 months)}: Full adoption and ecosystem transformation
\begin{itemize}
\item Industry adoption of academic evaluation standards for commercial LLM applications
\item Integration of overfitting detection into automated evaluation pipelines
\item Establishment of certification programs for workflow evaluation practices
\end{itemize}

\subsection{Addressing Implementation Challenges}

\textbf{Resistance and Practical Barriers}: Implementation will face several challenges:
\begin{itemize}
\item \textbf{Increased Development Costs}: Rigorous evaluation protocols require additional time and computational resources
\item \textbf{Competitive Pressures}: Research teams may resist transparency requirements that reveal development strategies
\item \textbf{Technical Complexity}: Detecting subtle forms of contamination requires sophisticated analysis tools
\item \textbf{Cultural Resistance}: The "publish or perish" academic culture incentivizes rapid publication over methodological rigor \citep{semmelrock2025reproducibility}
\end{itemize}

\textbf{Incentive Alignment Strategies}: 
\begin{itemize}
\item \textbf{Recognition Programs}: Establish awards and recognition for methodologically rigorous research that demonstrates genuine generalization
\item \textbf{Funding Priorities}: Research funding agencies should prioritize grants that include robust evaluation methodology components
\item \textbf{Career Advancement}: Academic institutions should recognize methodological contributions in promotion and hiring decisions
\item \textbf{Industry Partnerships}: Collaborate with industry to establish evaluation standards that benefit both academic research and commercial applications
\end{itemize}

\subsection{Success Metrics and Monitoring}

To measure the success of these implementation efforts, the community should track:
\begin{itemize}
\item \textbf{Compliance Rates}: Percentage of papers meeting extended reproducibility requirements
\item \textbf{Replication Success}: Independent reproduction rates for published workflow results
\item \textbf{Performance Gaps}: Differences between benchmark performance and deployment scenarios
\item \textbf{Community Adoption}: Uptake of evaluation standards across different research institutions and geographic regions
\end{itemize}

Regular community surveys and systematic replication studies should monitor progress toward addressing the overfitting crisis. The goal is not to eliminate all evaluation challenges but to ensure that reported advances represent genuine progress in LLM capabilities rather than sophisticated optimization against known benchmarks.

\subsection{AutoGPT: Autonomous Agent Development}

AutoGPT \citep{zhang2023autogpt} represents one of the first widely accessible autonomous agent frameworks, showcasing the challenges of evaluating complex agentic systems. Released in March 2023, AutoGPT quickly gained popularity but also demonstrated the evaluation challenges inherent in autonomous systems.

\textbf{Development and Benchmarking}: AutoGPT's development involved extensive community experimentation across various tasks, from software development to market research. The AutoGPT project established its own benchmarking framework \citep{autogpt2023benchmark}, designed to evaluate agent performance across categories like code generation, retrieval, memory, and safety.

However, this benchmarking approach faces the same overfitting challenges we identify. The benchmark tasks were developed concurrently with the agent framework, and community contributions often focused on improving performance on these specific evaluation criteria.

\textbf{Community-Driven Overfitting}: The open-source nature of AutoGPT created a unique form of distributed overfitting. As community members shared successful prompting strategies and workflow patterns, these approaches became optimized for the types of tasks commonly discussed and evaluated within the AutoGPT ecosystem.

\textbf{Real-World Performance Gaps}: Despite impressive performance on community benchmarks, AutoGPT users frequently reported significant performance gaps in real-world deployment scenarios. The framework's tendency toward infinite loops and inability to handle novel scenarios suggest that benchmark performance may not reflect genuine autonomous capabilities.

\subsection{Scientific Workflow Development}

LLM workflows in scientific domains present particularly concerning examples of potential overfitting, given the high stakes of scientific accuracy and reproducibility.

\textbf{Chemistry and Biology Applications}: Recent work on scientific reasoning systems often involves iterative development against established scientific benchmarks. Teams develop multi-step reasoning workflows, optimize tool usage for literature search and data analysis, and refine agent communication patterns based on performance on scientific datasets.

However, many of these scientific benchmarks contain problems that have been widely studied and discussed in the scientific literature. LLMs trained on scientific corpora may have indirect exposure to these problems, and workflow optimization against such benchmarks may not reflect genuine scientific reasoning capabilities.

\textbf{Evaluation Challenges}: Scientific applications demand higher standards of accuracy and reliability than many other domains. Yet the evaluation practices for scientific LLM workflows often mirror those used in general-purpose AI research, potentially leading to overconfidence in systems that have been optimized against known scientific problems rather than tested on truly novel research challenges.

\section{Comprehensive Framework for Addressing Workflow Overfitting}

Building on established ML evaluation best practices \citep{chandrasekaran2023test} and modern ML operations frameworks \citep{mlops2024crisp}, we propose a comprehensive approach to address workflow overfitting that extends beyond traditional model evaluation to encompass the entire development ecosystem.

\subsection{Data Separation Protocols}

\textbf{Temporal Isolation}: Establish evaluation benchmarks using data that postdates the workflow development period. This prevents any possibility of optimization against evaluation data, even indirect optimization through community knowledge transfer.

\textbf{Domain Isolation}: Develop workflows in one domain and evaluate on structurally similar but content-distinct domains. For example, develop reasoning workflows on history datasets and evaluate on scientific reasoning tasks with similar logical structures but different content areas.

\textbf{Multi-Level Holdouts}: Implement nested holdout strategies where:
\begin{itemize}
\item Level 1: Component-level evaluation using standard ML holdout practices
\item Level 2: Workflow-level evaluation on integration benchmarks never used during development
\item Level 3: System-level evaluation on deployment scenarios that differ systematically from development contexts
\end{itemize}

\subsection{Methodology Transparency Requirements}

\textbf{Development History Documentation}: Require comprehensive documentation of:
\begin{itemize}
\item All datasets accessed during development, including for prompt engineering
\item Number of iterations performed against each benchmark
\item Specific optimization targets that guided design decisions
\item Community knowledge and published work consulted during development
\end{itemize}

\textbf{Contamination Auditing}: Establish systematic procedures for identifying potential contamination sources:
\begin{itemize}
\item Analysis of training data overlap with evaluation sets
\item Documentation of published work that may have influenced design decisions
\item Assessment of community knowledge transfer through forums, papers, and code repositories
\end{itemize}

\textbf{Preregistration Protocols}: Adapt preregistration practices from experimental psychology and medical research to LLM workflow development:
\begin{itemize}
\item Declare evaluation benchmarks before beginning development
\item Specify success criteria and evaluation protocols in advance
\item Commit to reporting results regardless of outcome
\end{itemize}

\subsection{Community Infrastructure}

\textbf{Benchmark Lifecycle Management}: Establish systematic protocols for benchmark retirement and replacement:
\begin{itemize}
\item Monitor community optimization pressure on specific benchmarks
\item Implement automatic retirement triggers based on performance saturation
\item Develop systematic approaches for creating replacement benchmarks that maintain evaluation validity
\end{itemize}

\textbf{Shared Holdout Resources}: Create community-managed evaluation infrastructure:
\begin{itemize}
\item Third-party evaluation services that maintain truly independent benchmarks
\item Collaborative evaluation protocols that prevent individual teams from accessing evaluation data during development
\item Standardized reporting formats that enable meta-analysis across different workflow approaches
\end{itemize}

\textbf{Incentive Alignment}: Restructure research incentives to reward methodological rigor:
\begin{itemize}
\item Conference and journal policies that require methodology transparency
\item Recognition for negative results and replication studies
\item Funding priorities that emphasize evaluation methodology alongside technical innovation
\end{itemize}

\section{Why This Matters for Scientific Applications}

The overfitting problem is particularly concerning for scientific applications. Scientific research demands reproducible results, but if LLM workflows are optimized against the same benchmarks they're evaluated on, reported performance may not generalize to real scientific problems.

Overfitted workflows may perform well on known benchmarks but fail on novel scientific challenges, leading to misplaced confidence in AI capabilities for scientific discovery. If reported performance is inflated due to overfitting, research funding and effort may be misdirected toward approaches that don't actually advance scientific capability.

The stakes are particularly high in scientific domains where accuracy and reliability are paramount, and where the cost of false confidence can impede genuine scientific progress.



\section{Discussion and Implications}

Unlike traditional ML overfitting, which typically affected single models, LLM workflow overfitting can contaminate entire research directions. When the community collectively optimizes against the same benchmarks, the cumulative effect creates systematic bias across the field.

The current practice conflates innovation (developing better workflows) with evaluation (measuring progress). True evaluation requires independence from the development process. As LLM workflows become more sophisticated and are deployed in critical applications, the cost of overfitted performance becomes higher.

The scientific community cannot afford to repeat ML's early mistakes, especially when the applications involve high-stakes scientific discovery and decision-making.

\section{Conclusion}

The LLM research community stands at a critical juncture. The sophistication of modern workflows has obscured a fundamental methodological problem: we are systematically reporting results on data that has been used for optimization. This mirrors the overfitting crisis that plagued early machine learning research.

Unlike traditional ML, where overfitting affected individual models, LLM workflow overfitting can contaminate entire research directions and mislead the community about actual capabilities. This is particularly concerning for scientific applications, where accuracy and reliability are paramount.

The solution requires collective action: establishing strict data separation protocols, demanding transparency in development processes, and creating truly independent evaluation resources. The ML community learned these lessons decades ago. The LLM community must learn them now, before the credibility of AI research is further undermined.

We call on the research community to adopt rigorous evaluation standards that separate development from testing, just as was necessary in traditional ML. Only through such methodological rigor can we ensure that reported advances represent genuine progress rather than sophisticated forms of overfitting.

\bibliographystyle{plainnat}
\bibliography{overfitting_llm}

\newpage

\section*{Agents4Science AI Involvement Checklist}

\begin{enumerate}
    \item \textbf{Hypothesis development}: The research topic and hypothesis that LLM workflows suffer from systematic overfitting similar to early ML practices was jointly developed, with the AI agent providing the detailed conceptual framework and historical parallels.

    Answer: \textbf{[B] Mostly human, assisted by AI}
    
    Explanation: The human co-author identified the core problem and made the connection to historical ML overfitting. The AI agent developed the detailed hypothesis, literature connections, and theoretical framework for understanding the problem.

    \item \textbf{Experimental design and implementation}: This is a position paper that does not include empirical experiments. The analysis relies on synthesizing existing literature and making methodological arguments.

    Answer: \textbf{[NA] Not Applicable}
    
    Explanation: No experiments were conducted. The paper makes its argument through literature analysis, theoretical reasoning, and methodological critique of current practices.

    \item \textbf{Analysis of data and interpretation of results}: The paper analyzes existing literature and research practices rather than experimental data. The AI agent conducted literature search and synthesis.

    Answer: \textbf{[C] Mostly AI, assisted by human}
    
    Explanation: The AI agent conducted web searches for relevant literature, identified key papers on data contamination and benchmark leakage, and synthesized findings to support the core arguments. Human provided guidance on focus areas.

    \item \textbf{Writing}: The AI agent was responsible for the majority of the writing, including structure, arguments, and academic formatting.

    Answer: \textbf{[C] Mostly AI, assisted by human}
    
    Explanation: The AI agent wrote the complete paper including abstract, introduction, literature review, arguments, and conclusions. The human co-author provided feedback, guidance on emphasis areas, and editorial input.

    \item \textbf{Observed AI Limitations}: The AI agent required guidance on specific examples and emphasis areas, and needed human oversight to ensure arguments remained grounded and credible.

    Description: The AI agent occasionally needed direction on which aspects of the overfitting problem to emphasize most strongly. The human co-author's role was crucial in keeping the argument focused and ensuring it addressed the most important methodological concerns in the field.
\end{enumerate}

\newpage

\section*{Agents4Science Paper Checklist}

\begin{enumerate}

\item \textbf{Claims}
    \item[] Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: The abstract and introduction clearly state this is a position paper arguing that LLM workflows suffer from systematic overfitting, with claims appropriately scoped to methodological critique rather than empirical findings.

\item \textbf{Limitations}
    \item[] Question: Does the paper discuss the limitations of the work performed by the authors?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: Section 7 discusses the scope of the argument and acknowledges this is a methodological position paper without empirical validation of proposed solutions.

\item \textbf{Theory assumptions and proofs}
    \item[] Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: This is a position paper that makes methodological arguments rather than theoretical claims requiring formal proofs.

\item \textbf{Experimental result reproducibility}
    \item[] Question: Does the paper fully disclose all the information needed to reproduce the main experimental results?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: This paper does not include experiments. It is a methodological position paper based on literature analysis and reasoned argument.

\item \textbf{Open access to data and code}
    \item[] Question: Does the paper provide open access to the data and code?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: No experiments or code are involved in this position paper.

\item \textbf{Experimental setting/details}
    \item[] Question: Does the paper specify all training and test details necessary to understand the results?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: This paper does not include experiments.

\item \textbf{Experiment statistical significance}
    \item[] Question: Does the paper report error bars or statistical significance information?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: No experiments are conducted in this position paper.

\item \textbf{Experiments compute resources}
    \item[] Question: Does the paper provide sufficient information on computer resources needed?
    \item[] Answer: \textbf{[NA]}
    \item[] Justification: No experiments requiring computational resources were conducted.
    
\item \textbf{Code of ethics}
    \item[] Question: Does the research conform with the Agents4Science Code of Ethics?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: This methodological critique aims to improve research practices and scientific integrity, fully conforming with ethical research standards.

\item \textbf{Broader impacts}
    \item[] Question: Does the paper discuss potential positive and negative societal impacts?
    \item[] Answer: \textbf{[Yes]}
    \item[] Justification: Section 5 discusses implications for scientific applications and the risks of methodological failures in high-stakes research domains.

\end{enumerate}

\end{document}