\begin{figure}[htb]
    \centering
    \includegraphics[width=0.8\columnwidth]{figures/figure1_v3.png}
    \vspace{-10pt}
    \caption{Performance of SOTA AI coding agents on ReX-MLE.} 
    % All agents show substantially lower average scores compared to documented human competition winners.}
    \label{fig:figure1}
\end{figure}

\clearpage

\section{Introduction}
% \textcolor{red}{Keep intro within 1 page.}
Recent advances in large language models (LLMs) have enabled autonomous coding agents capable of solving standard machine learning engineering (MLE) and software engineering tasks~\cite{liu2025ml,chan2024mle,yang2025rdagentllmagentframeworkautonomous}. These frameworks indicate the potential for agents to operate as autonomous AI scientists in the future~\cite{gottweis2025towards}. However, despite their success on general benchmarks, current agents break down when faced with complex, domain-specific scientific challenges~\cite{zhu2025ai}. Medical imaging represents a particularly demanding domain that exposes these limitations. 

Unlike standard computer vision benchmarks, medical imaging tasks involve high-dimensional and heterogeneous modalities, including 3D CT and MRI volumes, multi-parametric scans, and gigapixel pathology slices, which require specialized preprocessing, normalization, and augmentation strategies~\cite{shen2017deep}. These processes often require the specialized insight of someone with substantial hands‑on experience. Training these models often requires multi-stage pipelines, long training cycles, and careful hyperparameter tuning, demanding meticulous workflow management~\cite{eisenmann2023winner}. 
% . Addressing these challenges typically requires deep domain expertise and meticulous computational management~\cite{eisenmann2023winner}.

Despite rapid progress in agent capabilities, existing evaluation frameworks remain poorly aligned with real-world scientific requirements. Current coding benchmarks \cite{jimenez2023swe, chan2024mle, tang2023ml, padigela2025ml} tend to emphasize code generation or simplified ML pipeline construction, but overlook 
% the dimensions that determine success in medical imaging: evaluation using 
competition-grade metrics, assessment of actual model outputs under realistic training constraints, adherence to domain-expert preprocessing and validation standards, and comparison against documented winning strategies. 
Lacking these elements, prior benchmarks fail to surface the behaviors that prevent current agents from succeeding on real medical imaging tasks, such as early training termination, mismanagement of computational budgets, flawed preprocessing pipelines, and invalid evaluation procedures. 
% In practice, agents frequently terminate training early, mismanage limited computational budgets, produce flawed preprocessing pipelines, and adopt invalid evaluation procedures.
% Those failure modes remain invisible under existing benchmark designs.

To address these gaps, we construct \textbf{ReX-MLE}, a benchmark comprised of \textbf{20} diverse challenge tasks derived from \textbf{10} high‑impact medical imaging competitions. ReX-MLE spans \textbf{8} imaging modalities and multiple task types, including segmentation, detection, classification, image quality assessment, and generative enhancement.
Within this environment, agents must independently handle the full scientific workflow, like preprocessing, model design, training, and evaluation, and are required to generate full submission‑ready predictions (e.g., NIfTI volumes, masks, and JSON detection files) rather than single textual responses.

% To address these gaps, we introduce ReX-MLE, a benchmark comprised of 20 diverse challenge tasks derived from 10 high‑impact medical imaging competitions spanning segmentation, detection, classification, and generation. 
% The benchmark spans 10 different imaging modalities, including CT, MRI, ultrasound, and digital pathology, requiring agents to process high-dimensional, heterogeneous data rather than simplified toy datasets.
% ReX-MLE evaluates agents in a realistic, end‑to‑end setting in which they must independently handle preprocessing, model design, training, and evaluation. 
% Unlike other benchmarks, agents are required to generate full submission‑ready predictions (e.g., NIfTI volumes, segmentation masks, and JSON detection files) rather than a single textual response. 

% Our results reveal large performance gaps between state-of-the-art agents and human experts. As shown in Figure~\ref{fig:figure1}, even leading systems, including ML-Master and AIDE, utilizing SOTA API models (GPT-5, Gemini 3 Pro, Claude 4.5 Sonnet), consistently rank in the lowest percentiles relative to human competition winners. These findings highlight the need for benchmarks that reflect real scientific workflows and underscore the importance of tackling domain-specific obstacles to achieve truly autonomous scientific agents.

Our main contributions are: (i) We introduce ReX-MLE, a comprehensive benchmark for evaluating autonomous AI agents on domain-specific scientific challenges, derived from high-impact medical imaging competitions spanning segmentation, detection, classification, and generation.
(ii) We establish a rigorous evaluation protocol that assesses agents' abilities to generate full, submission-ready prediction files under strict computational and time constraints, mirroring the requirements of real-world scientific discovery.
(iii) We conduct an extensive evaluation of leading autonomous systems, including ML-Master and AIDE, utilizing state-of-the-art foundation models (GPT-5, Gemini 3 Pro, and Claude 4.5 Sonnet). As shown in Figure~\ref{fig:figure1}, \textbf{our results reveal a significant performance disparity, with even the most advanced agents consistently ranking in the lowest percentiles relative to human experts, highlighting critical gaps in domain-specific engineering and process validity.}



% While autonomous AI agents demonstrate the potential to autonomously conduct experiments, explore solution spaces through systematic search, and even perform end-to-end machine learning engineering~\citep{liu2025ml}, we find that they are still unable to solve complex, specialized challenges. 

% \textbf{Autonomous AI agents are reshaping the automation of scientific research~\citep{gottweis2025towards}.} 
% Recent systems demonstrate the potential to autonomously conduct experiments, explore solution spaces through systematic search, and even perform end-to-end machine learning engineering~\citep{liu2025ml}. 
% This vision has materialized most concretely in MLE-bench~\citep{chan2024mle}, a challenging benchmark comprising 75 Kaggle competitions spanning diverse domains, where state-of-the-art agents such as AIDE~\citep{schmidt2024aide}, ML-Master~\citep{liu2025ml}, and AIRA~\citep{toledo2025ai} achieve medal rates of 16.9\%, 47.7\%, and 43.0\%, respectively. These results suggest that autonomous systems may soon approach human-level proficiency in general ML workflows.



% \begin{figure}[htbp]
%     \centering
%     \includegraphics[width=0.95\columnwidth]{figures/AgentFigures.drawio.png}
%     \caption{Contrasting timelines and resources in medical imaging competition workflows. Human competitors work over 3 months with access to medical literature, domain techniques, previous competitions, systematic architecture exploration, pretrained models, community knowledge, ensembling, and iterative leaderboard feedback. AI agents compress this into 24 hours with only a subset of these capabilities: parsing tasks, loading data, training models, and debugging, without domain knowledge sources, specialized tools, or iterative refinement.}
%     \label{fig:workflows}
% \end{figure}


% However, a critical question remains: does this general engineering competence translate to specialized scientific domains? We find that when these same agents are subjected to the rigorous demands of medical imaging, their capabilities degrade dramatically. Unlike standard computer vision tasks often found in general benchmarks, medical imaging challenges require multimodal reasoning across 3D data formats (e.g., NIfTI, DICOM), management of cross-institutional variability, and the optimization of clinically constrained metrics.
% As illustrated in Figure~\ref{fig:workflows}, human competitors invest months into workflows that integrate literature review, exploration of 15–20 domain-specific model architectures, and careful refinement using clinical performance metrics. In contrast, current agents compress this entire process into a single 24-hour loop focused primarily on loading data, executing code, and producing one inference output, without access to medical knowledge sources or specialized tooling.


% Existing ML-agent benchmarks are insufficient to reveal these limitations. Benchmarks like MedQA or the standard MLE-Bench either focus on text-based reasoning or rely on simplified medical datasets that do not reflect the engineering reality of clinical deployment. Consequently, the field lacks a rigorous method to diagnose why agents fail when moving from generic coding tasks to specialized biomedical research.

% To address this gap, we introduce ReX-MLE, the first benchmark designed to stress-test agent performance on the full lifecycle of medical imaging research. Our contribution is threefold: \begin{itemize}
%     \item \textbf{A Rigorous Domain Benchmark.} We curate 20 challenges from 10 major Grand Challenge competitions, spanning 8 modalities (including CT, MRI, and Pathology) and diverse tasks from segmentation to reconstruction.
%     \item \textbf{Realistic Evaluation.} Unlike prior works that rely on static datasets, we evaluate agents on reproducibility and leaderboard performance using the official, clinically validated metrics from the original competitions.
%     \item \textbf{A Taxonomy of Failure.} Crucially, we move beyond simple performance scores to diagnose the root causes of failure. We systematically evaluate agent execution traces against the ``13 Winning Strategies'' identified by Eisenmann et al. in their analysis of top human competitors.
% \end{itemize}

% By combining quantitative leaderboards with this qualitative strategy analysis, we reveal that current agents operate as competent coders but poor researchers, mastering the syntax of machine learning while missing the semantic ``domain priors'' (such as data curation and domain-specific preprocessing) required for scientific success.





% AI agents are reshaping the automation of scientific research~\citep{}. Recent systems demonstrate the potential to autonomously conduct experiments~\citep{team2025internagent}, explore solution spaces through systematic search~\citep{toledo2025ai}, and even perform end-to-end machine learning engineering~\citep{liu2025ml}. This vision has materialized most concretely in MLE-bench~\citep{chan2024mle},a challenging benchmark comprising 75 Kaggle competitions spanning diverse domains, where state-of-the-art agents such as AIDE~\citep{schmidt2024aide}, ML-Master~\citep{liu2025ml}, and AIRA~\citep{toledo2025ai} have demonstrated impressive capabilities. 

%  Despite these promising results, a large implementation gap remains when agents face more complex challenges. This gap becomes particularly apparent in medical imaging competitions~\citep{}, which demand comprehensive knowledge of image processing techniques and pose significantly greater task difficulty. Here, the same state-of-the-art agents achieve medal rates far below both their MLE-bench performance and human competition winners. This dramatic disparity raises a fundamental question: \textit{Why do AI agents that excel at automating machine learning research fail so dramatically on medical imaging tasks?}

% Within just two years, these systems have progressed from an initial 16.9\% medal rate~\citep{chan2024mle} to 47.7\%~\citep{toledo2025ai} on a challenging benchmark comprising 75 Kaggle competitions spanning diverse domains, autonomously navigating complex ML pipelines from data exploration to model optimization.

% Despite these promising results, a large implementation gap remains when agents face more complex challenges. This gap becomes particularly apparent in medical imaging competitions~\citep{}, which demand comprehensive knowledge of image processing techniques and pose significantly greater task difficulty. Here, the same state-of-the-art agents achieve medal rates far below both their MLE-bench performance and human competition winners. This dramatic disparity raises a fundamental question: \textit{Why do AI agents that excel at automating machine learning research fail so dramatically on medical imaging tasks?}

% Understanding this failure is critical for developing AI agents capable of real-world scientific automation. Medical imaging represents a high-stakes application domain where automated ML could accelerate diagnostic algorithm development~\citep{} and improve healthcare outcomes~\citep{}. Moreover, the structured nature of competitions~\citep{}, with clearly defined tasks, evaluation metrics, and training data, provides an ideal controlled environment to systematically identify which specific capabilities current agents lack.


% Prior work has documented the strategies that lead to success in medical imaging competitions~\citep{eisenmann2023winner}, revealing common patterns in data preprocessing, architecture choices, and ensemble methods. Yet which of these capabilities are absent from current AI agents remains unclear. Do the failures stem from domain-specific knowledge gaps? Architectural limitations? Inadequate exploration strategies? Or something else entirely?

% \textbf{In this work}, we systematically investigate why state-of-the-art AI agents fail at medical imaging challenges. We evaluate four leading agents including AIDE, ML-Master, M3Builder, and InternAgent, across a curated benchmark of medical imaging competitions spanning segmentation, classification, and detection tasks. Through detailed analysis of their execution traces and failure patterns, we identify the specific capability gaps that separate these agents from successful human competitors. Our contributions are:
% \begin{itemize}[leftmargin=1cm]
%     \item A benchmark of medical imaging challenges adapted from prominent competitions (PUMA~\citep{}, TopCoW~\citep{}, TopBrain~\citep{}, DENTEX~\citep{}, etc.) designed to evaluate AI agent capabilities on domain-specific tasks
%     \item Systematic evaluation of four leading AI agents documenting their failure modes and performance gaps compared to human winners
%     \item A taxonomy of missing capabilities that prevent current agents from succeeding on medical imaging tasks
%     \item Evidence-based insights for developing more capable AI agents that can handle domain-specific scientific challenges
% \end{itemize}

% Our findings reveal that the gap is not merely about domain knowledge. Rather, it reflects fundamental limitations in how current agents approach problem-solving, handle domain constraints, and leverage specialized techniques. By characterizing these gaps, we provide a roadmap for building AI agents capable of true scientific automation across diverse domains.




