\section{Discussion}
\label{sec:discussion}
% These assessments are routine for human experts, but agents still fail to translate the knowledge in the literature into effective technical decisions.

\vspace{3pt} \noindent \textbf{Domain Knowledge.} Deep learning for medical imaging requires not only machine learning expertise but also deep understanding of the underlying data. Analyses of winning competition solutions show that substantial medical domain knowledge is applied during data preprocessing, a time-consuming stage demanding visualization, parameter tuning, and strategic decision-making. In domains such as CT, automated deep learning–based windowing methods have only recently begun to see adoption~\cite{Zhang2025InterpretableAW, Lee2018PracticalWS}. These challenges highlight why agents struggle: success in medical imaging depends on nuanced, modality‑specific judgments that emerge from years of experience, not simply from applying generic machine learning workflows. Without the ability to iteratively explore the data, validate assumptions, and adjust pipelines, agents fall short of the careful engineering that expert practitioners rely on. Although medical imaging knowledge is abundant in papers, forums, and open-source code, current agents struggle to leverage it effectively. Even widely adopted frameworks like nnU-Net \cite{Isensee2020nnUNetAS}, with over 5,000 citations, are often overlooked because agents cannot judge when they are relevant, creating a retrieval‑relevance gap where agents find methods but lack understanding to assess whether the task is truly 3D, whether the data support that approach, or whether computational resources are sufficient.


\vspace{3pt} \noindent \textbf{Beyond Domain Knowledge: The Medical Engineering Gap.}
The agents we evaluated (AIDE, R\&D Agent, and ML-Master) lack the medical deep learning knowledge and engineering skills necessary for medical imaging solutions. Even with winning strategies in Table \ref{tab:mlebench_medical_summary}, they cannot replicate results as they are fundamentally designed for general machine learning tasks and lack critical infrastructure for specialized formats (DICOM, NIfTI) and domain libraries (nnU-Net, MONAI, SimpleITK). Their tree search and reasoning mechanisms, effective for tabular or natural image tasks, fail to navigate 3D volumetric complexities like patch-based training, variable voxel spacing, or clinical metrics (Dice coefficients, Hausdorff distances), producing code that either fails or trains clinically meaningless models. M\textsuperscript{3}Builder~\citep{feng2025m} addresses this through medical-specific templates but remains limited by predefined frameworks (nnU-Net, Transformers), with its M\textsuperscript{3}Bench success likely reflecting template-solvable problems rather than tacit engineering knowledge for edge cases like unconventional protocols or debugging clinically implausible predictions without explicit error signals.



Additional failure modes outside of the winning strategies are based on the overall utility of the agents. For a given 80GB NVIDIA H100 GPU, the agents typically only utilized 10-20\% of the full memory while training their models, representing a significant opportunity cost in computational efficiency that human practitioners would naturally optimize through larger batch sizes, model ensembles, or hyperparameter sweeps. This inefficiency is compounded by the fact that most of these agent architectures are inherently iterative in nature and can only work on one task at a time rather than multi-tasking as a human would. While human experts routinely run multiple experiments in parallel, monitor training curves across different model configurations, and contextualize results from concurrent trials, these agents follow strictly sequential workflows that dramatically extend development time and limit their ability to efficiently explore the solution space within fixed time budgets.

\vspace{3pt} \noindent \textbf{Implications for Autonomous Scientific Research.}
While this work focuses on medical imaging, the failure modes we identify (retrieval-relevance gaps, domain-specific engineering deficits, and inability to apply documented best practices) extend to other expert scientific domains~\cite{schmidgall2024agentclinic, xie2024travelplanner, mitchener2025bixbench}. Although agentic frameworks grounded in scientific insight have been developed~\cite{ding2025scitoolagent, jansen2024discoveryworld}, current benchmarks primarily measure engineering competence rather than domain expertise. Our capability analysis in Figure \ref{fig:capabilities} reveals that agents fail to demonstrate fundamental scientific practices identified in winning solutions: systematic failure case analysis, domain-appropriate preprocessing strategies, and metric-aligned model design. These competencies emerge from years of iterative experience, understanding which methods transfer across problem instances, recognizing data artifacts versus true signal, and knowing which engineering shortcuts compromise clinical validity. This gap has critical implications for autonomous agents in high-stakes domains like drug discovery, materials science, and genomics, where incorrect modeling decisions waste experimental resources and may impact patient safety. Current agents' inability to recognize when they lack domain knowledge, evident in confident but meaningless submissions, suggests they cannot reliably self-assess their competence boundaries, a prerequisite for safe autonomous operation. Achieving genuine scientific autonomy requires architectural innovations beyond scaling: mechanisms for domain-specific reasoning rather than retrieval pattern-matching and competence self-awareness to identify when human expertise is required.


\vspace{3pt} \noindent \textbf{Bridging the Gap.}
Several actionable solutions could address the current limitations of autonomous ML agents in medical imaging. First, developing domain-specialized agent architectures with native support for medical formats (DICOM/NIfTI) and pre-integrated frameworks (nnU-Net, MONAI) would reduce the engineering burden of handling specialized data. Second, implementing parallel experimentation capabilities would allow agents to run multiple configurations simultaneously, better utilizing GPU resources (currently only 10-20\% utilized) and mirroring human expert workflows. Third, incorporating competence self-assessment mechanisms that flag unfamiliar modalities or request human verification would prevent confident but meaningless submissions. Fourth, enhancing retrieval-relevance systems with medical imaging-specific semantic understanding would help agents judge method applicability beyond simple pattern matching. Finally, domain-specific fine-tuning on medical imaging literature and competition solutions could help foundation models develop the tacit engineering knowledge currently accessible only through years of human experience.

\vspace{3pt} \noindent \textbf{Limitations.} Our work has several limitations. First, the standardized computational resources (H100 GPUs, 24-hour time budget) may not reflect all research environments. Second, potential data leakage exists if foundation models were pre-trained on competition solutions, though such knowledge did not translate to successful implementations. Third, our automated capability assessment provides scalability but lacks the depth of manual expert analysis, which would be prohibitively labor-intensive. Finally, ablation studies (Tables \ref{tab:3_day_performance_split} \& \ref{tab:agent_performance_models_split}) evaluated only 4 representative challenges due to computational constraints.