\section{Results}
\label{sec:results}

\subsection{Performance on the Medical Subset of MLE-Bench}



\begin{table*}[t]
\centering
\footnotesize	
\caption{Performance of autonomous ML agents on the medical subset of MLE-Bench. 
Each challenge is evaluated using its primary leaderboard metric ($\uparrow$ indicates higher is better; $\downarrow$ indicates lower is better). 
Values are reported as raw metric scores. 
\textbf{Bold} text denotes the best agent score for each challenge.}
\label{tab:mlebench_medical_summary}
\begin{tabular}{lcccccccc}
\toprule
\textbf{Challenge} &
\textbf{Metric} &
\textbf{Human} &
\multicolumn{2}{c}{\textbf{AIDE}} &
\multicolumn{2}{c}{\textbf{ML-Master}} &
\multicolumn{2}{c}{\textbf{R\&D-Agent}} \\
\cmidrule(lr){4-5} \cmidrule(lr){6-7} \cmidrule(lr){8-9}
& & & w/o sol. & w/ sol. & w/o sol. & w/ sol. & w/o sol. & w/ sol. \\
\midrule
HCD & AUROC $\uparrow$ & 0.983 & 0.988 & 0.989 &  0.990 & 0.992 & 0.981 & \textbf{0.993} \\
RANZCR & AUROC $\uparrow$ & 0.973 & 0.807 & 0.910  & 0.880  & 0.859  & \textbf{0.932} & 0.928 \\
ISIC & AUROC $\uparrow$ & 0.945 & 0.879 & \textbf{0.881}  & 0.845  & 0.808 & 0.787 & 0.791 \\
HuBMAP & DICE $\uparrow$& 0.948 & \textbf{0.133} & 0.000  & 0.000  & 0.000 & 0.008 & 0.000 \\
OSIC & LLL $\uparrow$& -6.841 & -7.999 & -8.915  & \textbf{-7.389}  & -8.191 & \textcolor{red}{FAIL} & -9.351 \\
UWGI & DICE/Haus. $\uparrow$& 0.879 & 0.366 & 0.000  & 0.000  & 0.000 & \textbf{0.580} & 0.509 \\
RSNA-Spine & Combo $\downarrow$& 0.276 & \textbf{0.563} & 0.761  & 0.600  & 0.675 & 4.959 & 0.656 \\
RSNA-BC & F1 $\uparrow$& 0.490 & 0.056 & 0.048  & 0.028  & \textcolor{red}{FAIL} & 0.043 & \textbf{0.120} \\
RSNA-RG & AUROC $\uparrow$& 0.600 & 0.454 & \textbf{0.547}  & 0.500  & 0.462 & 0.540 & 0.480 \\
SIIM-COVID & mAP $\uparrow$& 0.623 & 0.236 & \textbf{0.393}  &  \textcolor{red}{FAIL}  & 0.305 &\textcolor{red}{FAIL} & \textcolor{red}{FAIL} \\
VinBig & COCO $\uparrow$& 0.289 & \textcolor{red}{FAIL} & \textcolor{red}{FAIL}  & \textcolor{red}{FAIL}  & \textcolor{red}{FAIL} & \textbf{0.116} & \textcolor{red}{FAIL} \\
\bottomrule
\end{tabular}
\end{table*}

To establish a performance baseline, we first evaluated AIDE, ML-Master, and R$\&$D-Agent on the medical subset of the original MLE-Bench. Table \ref{tab:mlebench_medical_summary} shows that agents perform well on simpler tasks such as the HCD (Histopathologic Cancer Detection) challenge, 
where ML-Master achieved an AUROC of 0.992, surpassing the human gold standard of 0.983. 
However, across more complex medical imaging tasks, especially those involving segmentation or detection (e.g., HuBMAP, UWGI), performance deteriorated sharply.
Multiple agents produced near-zero Dice or mAP scores despite correct training scripts, indicating fundamental deficiencies in handling high-dimensional medical image data.

We conducted an additional experiment on the medical subset of MLE-Bench in which we provided the agents with the winning solution reports created by the 1\textsuperscript{st}-place winners of each challenge, excluding HCD, which did not have a report available. These reports described the methodology used by the winning teams but did not include the full training or evaluation code. Our analysis in Table \ref{tab:mlebench_medical_summary} shows that the performance gap remains even when agents are explicitly given the top-performing strategy (``w/ sol.''). The inability of agents to reproduce expert-level performance, even when provided with the solution, suggests that the limitation goes beyond domain knowledge or hypothesis generation and reflects a fundamental inability to carry out the engineering work needed to adapt, debug, and deploy medical imaging pipelines.

\subsection{Quantitative Performance Gap on ReX-MLE}

\begin{table*}[t]
\centering
\footnotesize	
\caption{Agent performance across the ReX-MLE suite with primary metric values and percentile ranks (Competition Rank) separated.}
\label{tab:agent_performance_full_split}
\begin{tabular}{l l 
                c c 
                c c 
                c c 
                c}
\toprule
\textbf{Challenge} & \textbf{Metric ($\uparrow$)} 
& \multicolumn{2}{c}{\textbf{AIDE}} 
& \multicolumn{2}{c}{\textbf{ML-Master}} 
& \multicolumn{2}{c}{\textbf{R\&D-Agent}} 
& \textbf{Human} \\
\cmidrule(lr){3-4}\cmidrule(lr){5-6}\cmidrule(lr){7-8}
& & Value & Rank & Value & Rank & Value & Rank & Value \\
\midrule
\multicolumn{9}{l}{\textit{\textbf{Segmentation Tasks}}} \\
ISLES'22 & Dice & 0.04 & 0\% & 0.00 & 0\%  & 0.02 & 0\% & 0.79 \\
NeurIPS-CellSeg & F1 & 0.04 & 0\% & 0.04 & 0\% & 0.36 & 0\% & 0.88 \\
PANTHER-T1 & Dice & 0.33 & 16\% & 0.13 & 8\% & 0.16 & 8\% & 0.73 \\
PANTHER-T2 & Dice & 0.09 & 10\% & 0.05 & 8\% & 0.28 & 58\% & 0.53 \\
PUMA-T1-Seg & Dice & \textcolor{red}{FAIL} & -- & 0.00 & 0\% & 0.00 & 0\% & 0.78 \\
PUMA-T2-Seg & Dice & 0.00 & 0\% & 0.00 & 0\% & 0.00 & 0\% & 0.78 \\
SEG.A & Dice & 0.02 & 0\% & 0.02 & 0\% & 0.00 & 0\% & 0.92 \\
TopBrain-CTA & Mean Dice & 0.03 & 2\% & 0.26 & 3\% & 0.08 & 2\% & 0.79 \\
TopBrain-MRA & Mean Dice & 0.01 & 10\% & 0.26 & 0\% & 0.50 & 0\% & 0.81 \\
TopCoW-CTA-Seg & Mean Dice & 0.09 & 0\% & 0.25 & 3\% & 0.49 & 2\% & 0.87 \\
TopCoW-MRA-Seg & Mean Dice & 0.11 & 0\% & 0.48 & 0\% & 0.73 & 3\% & 0.88 \\
\midrule
\multicolumn{9}{l}{\textit{\textbf{Detection Tasks}}} \\
DENTEX & AP & 0.09 & 0\% & 0.08 & 0\% & 0.09 & 0\% & 0.40 \\
PUMA-T1-Det & F1 & 0.02 & 0\% & 0.08 & 0\% & 0.06 & 0\% & 0.66 \\
PUMA-T2-Det & F1 & \textcolor{red}{FAIL} & -- & 0.00 & 0\% & 0.01 & 0\% & 0.27 \\
TopCoW-CTA-Det & IoU & 0.67 & 38\% & 0.65 & 25\% & 0.70 & 56\% & 0.79 \\
TopCoW-MRA-Det & IoU & 0.66 & 14\% & 0.69 & 14\% & 0.19 & 14\% & 0.85 \\
\midrule
\multicolumn{9}{l}{\textit{\textbf{Classification Tasks}}} \\
TopCoW-CTA-Cls & Accuracy & 0.33 & 33\% & 0.10 & 0\% & 0.28 & 50\% & 0.73 \\
TopCoW-MRA-Cls & Accuracy & 0.33 & 25\% & 0.33 & 25\% & 0.09 & 0\% & 0.89 \\
\midrule
\multicolumn{9}{l}{\textit{\textbf{Image Quality $\&$ Enhancement Tasks}}} \\
LDCT-IQA & Score & 2.62 & 33\% & 2.50 & 0\% & 2.66 & 50\% & 2.74 \\
USenhance & LNCC & 0.11 & 0\% & 0.13 & 0\% & \textcolor{red}{FAIL} & -- & 0.91 \\
\midrule
Overall Mean Percentile & -- & -- & 9.05\% & -- & 4.53\% & -- & \textbf{12.15}\% & -- \\
\bottomrule
\end{tabular}
\end{table*}


% \begin{table*}[t]
% \centering
% \footnotesize	
% \caption{Agent performance across the full ReX-MLE suite. 
% Each challenge is evaluated using its official competition metric ($\uparrow$  indicates higher is better). Values are shown as metric score with corresponding percentile rank in parenthesis relative to human competitors. Percentile rank are based on a mean rank of all evaluation metrics for a challenge (not just primary metric).
% \textbf{Bold} indicates the best agent performance for each challenge. 
% ``FAIL'' denotes missing or invalid submissions.}
% \label{tab:agent_performance}
% \begin{tabular}{lllllc}
% \toprule
% \textbf{Challenge} & \textbf{Metric} & \textbf{AIDE} & \textbf{ML-Master} & \textbf{R$\&$D Agent} & \textbf{Winner} \\
% \midrule
% \multicolumn{6}{l}{\textit{\textbf{Segmentation Tasks}}} \\
% ISLES'22 & Dice $\uparrow$ & 0.04 (0\%) & 0.02 (0\%) & 0.02 (0\%) & 0.79 \\
% NeurIPS-CellSeg & F1 $\uparrow$ & 0.04 (0\%) & 0.26 (0\%) & 0.36 (0\%) & 0.88 \\
% PANTHER-T1 & Dice $\uparrow$ & 0.33 (16\%) & 0.09 (7\%) & FAIL & 0.73 \\
% PANTHER-T2 & Dice $\uparrow$ & 0.09 (10\%) & 0.13 (11\%) & FAIL & 0.53 \\
% PUMA-T1-Seg & Dice $\uparrow$ & FAIL & 0.00 (0\%) & 0.00 (0\%) & 0.78 \\
% PUMA-T2-Seg & Dice $\uparrow$ & 0.00 (0\%) & 0.00 (0\%) & 0.00 (0\%) & 0.78 \\
% SEG.A & Dice $\uparrow$ & 0.02 (0\%) & 0.72 (0\%) & 0.00 (0\%) & 0.92 \\
% TopBrain-CTA & Mean Dice $\uparrow$ & 0.03 (2\%) & 0.04 (16\%) & 0.08 (2\%) & 0.79 \\
% TopBrain-MRA & Mean Dice $\uparrow$ & 0.01 (10\%) & 0.0 (24\%) & 0.50 (0\%) & 0.81 \\
% TopCoW-CTA-Seg & Mean Dice $\uparrow$ & 0.09 (0\%) & 0.12 (0\%) & FAIL & 0.87 \\
% TopCoW-MRA-Seg & Mean Dice $\uparrow$ & 0.11 (0\%) & 0.27 (0\%) & 0.73 (3\%) & 0.88 \\
% \midrule
% \multicolumn{6}{l}{\textit{\textbf{Detection Tasks}}} \\
% DENTEX & AP $\uparrow$ & 0.09 (0\%) & FAIL & 0.00 (0\%) & 0.40 \\
% PUMA-T1-Det & F1 $\uparrow$ & 0.02 (0\%) & 0.01 (0\%) & 0.06 (0\%) & 0.66 \\
% PUMA-T2-Det & F1 $\uparrow$ & FAIL & 0.00 (0\%) & 0.01 (0\%) & 0.27 \\
% TopCoW-CTA-Det & IoU $\uparrow$ & 0.57 (38\%) & 0.04 (0\%) & 0.70 (56\%) & 0.79 \\
% TopCoW-MRA-Det & IoU $\uparrow$ & 0.66 (14\%) & 0.67 (13\%) & 0.19 (14\%) & 0.85 \\
% \midrule
% \multicolumn{6}{l}{\textit{\textbf{Classification Tasks}}} \\
% TopCoW-CTA-Cls & Accuracy $\uparrow$ & 0.33 (33\%) & 0.02 (0\%) & 0.28 (50\%) & 0.73 \\
% TopCoW-MRA-Cls & Accuracy $\uparrow$ & 0.33 (25\%) & 0.00 (0\%) & 0.09 (0\%) & 0.89 \\
% \midrule
% \multicolumn{6}{l}{\textit{\textbf{Image Quality $\&$ Enhancement Tasks}}} \\
% LDCT-IQA & Score $\uparrow$ & 2.62482 (33\%) & 2.56(0\%) & 2.66 (50\%) & 2.74 \\
% USenhance & Score $\uparrow$ & 0.11 (0\%) & 0.14(0\%) & FAIL & 0.91 \\
% \midrule
% Overall & Mean Rank $\uparrow$ & 9.05\% & 3.55\% & 8.75\% & / \\
% \bottomrule
% \end{tabular}
% \end{table*}

% \begin{table*}[t]
% \centering
% \small	
% \caption{Agent performance across the full ReX-MLE suite (percentile ranks only).}
% \label{tab:agent_performance_updated}
% \begin{tabular}{llll}
% \toprule
% \textbf{Challenge} & \textbf{AIDE} & \textbf{ML-Master} & \textbf{R\&D Agent} \\
% \midrule
% \multicolumn{4}{l}{\textit{\textbf{Segmentation Tasks}}} \\
% ISLES'22 & 0\% & 0\% & 0\% \\
% NeurIPS-CellSeg & 0\% & 0\% & 0\% \\
% PANTHER-T1 & 16\% & 7\% & FAIL \\
% PANTHER-T2 & 10\% & 11\% & FAIL \\
% PUMA-T1-Seg & FAIL & 0\% & 0\% \\
% PUMA-T2-Seg & 0\% & 0\% & 0\% \\
% SEG.A & 0\% & 0\% & 0\% \\
% TopBrain-CTA & 2\% & 16\% & 2\% \\
% TopBrain-MRA & 10\% & 24\% & 0\% \\
% TopCoW-CTA-Seg & 0\% & 0\% & FAIL \\
% TopCoW-MRA-Seg & 0\% & 0\% & 3\% \\
% \midrule
% \multicolumn{4}{l}{\textit{\textbf{Detection Tasks}}} \\
% DENTEX & 0\% & FAIL & 0\% \\
% PUMA-T1-Det & 0\% & 0\% & 0\% \\
% PUMA-T2-Det & FAIL & 0\% & 0\% \\
% TopCoW-CTA-Det & 38\% & 0\% & 56\% \\
% TopCoW-MRA-Det & 14\% & 13\% & 14\% \\
% \midrule
% \multicolumn{4}{l}{\textit{\textbf{Classification Tasks}}} \\
% TopCoW-CTA-Cls & 33\% & 0\% & 50\% \\
% TopCoW-MRA-Cls & 25\% & 0\% & 0\% \\
% \midrule
% \multicolumn{4}{l}{\textit{\textbf{Image Quality \& Enhancement Tasks}}} \\
% LDCT-IQA & 33\% & 0\% & 50\% \\
% USenhance & 0\% & 0\% & FAIL \\
% \midrule
% Overall Mean Rank & 9.05\% & 3.55\% & 8.75\% \\
% \bottomrule
% \end{tabular}
% \end{table*}


We further evaluated AIDE, ML-Master, and R$\&$D-Agent across all 20 challenges in ReX-MLE.
As detailed in Table \ref{tab:agent_performance_full_split}, agents failed to achieve expert-level performance across all 20 tasks, with the majority of submissions scoring in the $0^{th}$ relative to human competitors.

\vspace{3pt} \noindent \textbf{Segmentation Tasks.} Segmentation represents the most challenging category. In pathology segmentation (PUMA), AIDE failed to produce valid submissions, and ML-Master and R\&D-Agent achieved Dice scores of zero. These failures likely stem from agents’ inability to handle gigapixel WSI data, whether via memory‑efficient patching or proper preprocessing. Similar trends are observed in volumetric neurovascular tasks (TopBrain, TopCoW), where the majority of mean Dice scores remain below 0.3, despite winners exceeding 0.85. R\&D-Agent shows modest improvement on certain TopCoW and TopBrain tasks, achieving Dice scores up to 0.73, though still below human performance. These results suggest difficulties in handling 3D spatial consistency, voxel spacing normalization, and robust volumetric inference.

\vspace{3pt} \noindent \textbf{Detection and Classification Tasks.} Compared with segmentation, agents achieved slightly better performance on detection and classification tasks, but results remain substantially worse than human competitors. For example, on TopCoW-CTA-Det, R$\&$D-Agent reached an IoU of 0.70 (56\%), representing one of the few non-zero-percentile outcomes across the entire benchmark. Similarly, agents achieved moderate accuracies in the TopCoW classification tasks (e.g., AIDE at 33\%), although still far below the winning solutions (0.73). Nevertheless, the overall performance remains poor: scores are inconsistent, often collapse to 0\%, and never approach competitive human baselines. Thus, even on the tasks where agents show their best results, they are still far from demonstrating reliable or clinically meaningful competence.

\vspace{3pt} \noindent \textbf{Generative and Quality Assessment.} Performance on generative tasks was notably poor, with the USenhance (Ultrasound enhancement) task showing agents achieved scores of $\sim0.11$ against a human baseline of 0.91. Visual inspection suggests agents treated the task as simple style transfer without accounting for the specific speckle noise statistics of ultrasound physics. Conversely, on LDCT-IQA (CT Image Quality Assessment), R\&D-Agent achieved a score of 2.66 (50th percentile), coming close to the winning score of 2.74. However, this strong performance may be influenced by the metric’s sensitivity to global image statistics rather than a genuine understanding of diagnostic image quality, a pattern examined further in our failure taxonomy. Notably, LDCT-IQA is the only challenge that does not require submitting separate prediction files beyond a \texttt{submission.csv}, suggesting that these agents perform best when the submission process is substantially simplified but not realistic.

 \subsection{Capability Analysis}

% \begin{table*}[t]
% \centering
% \footnotesize	
% \caption{Agent capability evaluation across 13 winning strategies from medical imaging competition winners~\cite{eisenmann2023winner}. For each strategy and agent, we score each of the 20 tasks as 0 (strategy not demonstrated) or 1 (strategy demonstrated), then report the average score. Scores range from 0 (never demonstrates strategy) to 1 (demonstrates strategy on all tasks). Based on manual analysis of agent execution traces, generated code, and comparison with documented winning approaches.}
% \label{tab:winning_strategies}
% \begin{tabular}{lp{7.5cm}cccc}
% \toprule
% \textbf{\#} & \textbf{Winning Strategy} & \textbf{AIDE} & \textbf{ML-Master} & \textbf{R$\&$D Agent}\\
% \midrule
% 1 & Analyzing and handling failure cases & & & & \\
% 2 & Knowing the state of the art in the field & & & & \\
% 3 & Reflecting the metrics in the method design & & & & \\
% 4 & Having domain knowledge & & & & \\
% 5 & Investing in an experiment pipeline that allows rapid iteration of experiments & & & & \\
% 6 & Optimizing the augmentation method & & & & \\
% 7 & Incorporating domain experts in the challenge team & & & & \\
% 8 & Spending time on data curation & & & & \\
% 9 & Postprocessing results & & & & \\
% 10 & Ensembling heterogeneous models & & & & \\
% 11 & Leveraging external data from the same domain & & & & \\
% 12 & Ensembling models trained on different data and/or with different seeds & & & & \\
% 13 & Optimizing hyperparameters systematically & & & & \\
% % 14 & Performing pretraining & & & & \\
% % 15 & Performing data sampling (e.g. to handle imbalance) & & & & \\
% % 16 & Exploring different network architecture(s) in parallel & & & & \\
% % 17 & Leveraging external data from different domains & & & & \\
% % 18 & Exploring different training paradigms in parallel & & & & \\
% % 19 & Finding optimal data splits & & & & \\
% % 20 & Working in a large team & & & & \\
% \bottomrule
% \end{tabular}
% \end{table*}


\begin{figure}[t]
    \centering
    \includegraphics[width=0.9\columnwidth]{figures/figure6.png}
    \caption{Comparison of ML research agent capabilities across 13 key success factors.}
    \label{fig:capabilities}
\end{figure}


To investigate the underlying causes of the performance gap, we moved beyond outcome-level metrics and examined the process validity of each agent’s workflow by evaluating every execution trace excluding code against the 13 Winning Strategies identified by \citet{eisenmann2023winner} in their meta-analysis of biomedical competition winners. Given the scale of our benchmark, which includes 60 execution traces across 20 challenges, manual annotation was impractical, so we developed an automated adjudication pipeline powered by a state-of-the-art large language model (GPT-5) acting as a technical evaluator. This system parses full execution logs, including shell commands, Python code, and internal reasoning traces, and assigns each strategy a binary score: 1 when explicit evidence is present (for example, importing nibabel for resampling or applying test-time augmentation) and 0 when the evidence is absent or ambiguous.
To validate this automated approach, we conducted human expert annotation on 50 randomly sampled traces. Two domain experts independently annotated whether each strategy was implemented, yielding 100\% agreement with GPT-5's judgments. This high reliability stems from our deliberate reformulation of capability assessment as binary evidence detection under a strict, predefined rubric, rather than open-ended qualitative judgment.

The capability scores in Figure \ref{fig:capabilities} reveal substantial differences in how agents leverage winning strategies. R\&D-Agent demonstrates notably higher coverage across multiple strategies, particularly excelling in having domain knowledge (0.663), reflecting metrics in method design (0.638), postprocessing results (0.614), and analyzing and handling failure cases (0.601). In contrast, AIDE and ML-Master show comparable but considerably lower coverage overall. ML-Master exhibits modest strengths in analyzing and handling failure cases (0.322) and knowing the state of the art (0.021), while AIDE shows its highest engagement in postprocessing results (0.339) and reflecting metrics in method design (0.195). Despite these differences, all three agents converge on notably poor performance in several critical areas: optimizing the augmentation method, ensembling, and leveraging external data, where scores approach or reach zero. Overall, the results show that while R\&D-Agent engages with winning strategies substantially more frequently than AIDE and ML-Master, even this higher coverage falls short of expert-level rigor, and as demonstrated by our previous results, capability presence does not guarantee correct execution or improved outcomes.

\begin{table*}[t]
\centering
\footnotesize	
\caption{3-Day agent performance across representative ReX-MLE challenges with primary metric values and percentile ranks (Competition Rank) separated.}
\label{tab:3_day_performance_split}
\begin{tabular}{l l 
                c c 
                c c 
                c c 
                c}
\toprule
\textbf{Challenge} & \textbf{Metric ($\uparrow$)} 
& \multicolumn{2}{c}{\textbf{AIDE}} 
& \multicolumn{2}{c}{\textbf{ML-Master}} 
& \multicolumn{2}{c}{\textbf{R\&D-Agent}} 
& \textbf{Human} \\
\cmidrule(lr){3-4}\cmidrule(lr){5-6}\cmidrule(lr){7-8}
& & Value & Rank & Value & Rank & Value & Rank & Value \\
\midrule
ISLES'22          & Dice     & 0.18 & 0\%  & 0.36 & 0\%  & 0.04 & 25\% & 0.79 \\
DENTEX            & AP       & 0.10 & 0\%  & 0.02 & 0\%  & 0.03 & 0\%  & 0.40 \\
TopCoW-CTA-Cls    & Accuracy & 0.31 & 25\% & 0.26 & 25\% & 0.26 & 25\% & 0.73 \\
LDCT-IQA          & Score    & 2.24 & 0\%  & 2.64 & 33\%  & 2.70 & 83\% & 2.74 \\
\midrule
Overall Mean Percentile & -- & -- & 6.25\% & -- & 14.5\% & -- & 33.25\% & -- \\
\bottomrule
\end{tabular}
\end{table*}

\subsection{Effect of Time Budget and Model Backend}
To assess how agent performance scales with compute and model choice, we ran two ablations: (1) extending the time budget from 24 to 72 hours, and (2) evaluating agents across three LLMs (GPT-5, Gemini 3, Claude Sonnet 4.5). Both ablations use four representative challenges spanning segmentation, detection, classification, and image quality assessment.
% To better understand how agent performance scales with additional compute and with different foundation model backends, we conducted two ablation experiments: (1) extending the time budget from 24 hours to 72 hours (``3-Day Experiments''), and (2) evaluating agents across three LLM providers (GPT-5, Gemini, and Claude) on a subset of representative challenges.

\paragraph{Three-Day Experiments.}
Table~\ref{tab:3_day_performance_split} shows that extending the time budget to three full days yields minimal improvement for most agents and tasks. On ISLES'22 and DENTEX, all agents remain in the 0th percentile, except for R$\&$D-Agent on ISLES'22, which reaches the 25th percentile. Classification performance on TopCoW-CTA-Cls improves modestly to the 25th percentile for all agents but remains far below expert solutions. The only substantial gain is observed on LDCT-IQA, where R$\&$D-Agent reaches the 83rd percentile, suggesting that simpler, non-volumetric tasks benefit more from additional time than complex 3D tasks. Overall mean percentile ranks are 6.25\% for AIDE, 14.5\% for ML-Master, and 33.25\% for R$\&$D-Agent, indicating that increased time alone does not resolve core failure modes.

% Across ISLES'22 and DENTEX, all agents remain in the 0th percentile except for R\&D-Agent, which shows a slight improvement. This indicates that failures stem from structural weaknesses, such as inadequate preprocessing, model misconfiguration, or inability to execute full training cycles, rather than insufficient wall-clock time. Classification performance on TopCoW-CTA-Cls improved slightly to the 25th percentile, but remained far below expert solutions. The only substantial gain appears in LDCT-IQA, where R$\&$D-Agent reached the 83rd percentile, again suggesting that simpler, non-3D tasks benefit more from additional time, while complex volumetric tasks do not. Overall, the mean percentile rank on these 4 challenges actually decreased for AIDE (16.5\% to 6.25\%) and improved marginally for ML-Master (0\% to 14.5\%), and improved moderately for R\&D-Agent (25\% to 33.25\%), confirming that time alone does not resolve core failure modes. % such as memory inefficiency, preprocessing errors, and pipeline fragility.


\paragraph{Model Backend Ablation.}
Table~\ref{tab:agent_performance_models_split} evaluates AIDE and ML-Master using GPT-5, Gemini 3, and Claude Sonnet 4.5 as their backend LLMs. We exclude R\&D-Agent from these results as it has no native backend support for Claude and Gemini. While LLM choice influences absolute scores, the overall pattern of failure is unchanged. In segmentation (ISLES’22), Claude produces the highest Dice score (0.65), yet still ranks in the 25th percentile, indicating that even different backends cannot compensate for missing domain-specific engineering. In detection (DENTEX), Gemini enables AIDE to reach the 8th percentile, but performance remains far below the winning 40\% AP. Classification tasks show minor variation across backends but remain capped at 25--33\% percentiles. For LDCT-IQA, backend differences again alter absolute scores but do not meaningfully enable agents to approach competitive performance. Across all tasks, backend effects are second-order relative to structural agent limitations: agents cannot reliably build, train, or validate domain-appropriate medical imaging pipelines, regardless of the foundation model driving their reasoning.


\begin{table*}[t]
\centering
\footnotesize	
\caption{Varying backend LLM performance across representative ReX-MLE challenges with primary metric values and percentile ranks (Competition Rank) separated. The specific models used are GPT-5, Gemini 3, and Claude Sonnet 4.5}
\setlength{\tabcolsep}{10pt}
\label{tab:agent_performance_models_split}
\begin{tabular}{l l l 
                c c 
                c c 
                c}
\toprule
\textbf{Challenge} & \textbf{Model} & \textbf{Metric ($\uparrow$)} 
& \multicolumn{2}{c}{\textbf{AIDE}} 
& \multicolumn{2}{c}{\textbf{ML-Master}} 
& \textbf{Human} \\
\cmidrule(lr){4-5}\cmidrule(lr){6-7}
& & & Value & Rank & Value & Rank & Value \\
\midrule

\multicolumn{8}{l}{\textit{\textbf{Segmentation Tasks}}} \\
ISLES'22 & GPT  & Dice & 0.04 & 0\%  & 0.00 & 0\%  & 0.79 \\
ISLES'22 & Gemini & Dice & \textcolor{red}{FAIL} & --   & 0.46 & 25\% & 0.79 \\
ISLES'22 & Claude & Dice & 0.45 & 5\%  & 0.65 & 25\%  & 0.79 \\
\midrule

\multicolumn{8}{l}{\textit{\textbf{Detection Tasks}}} \\
DENTEX & GPT  & AP & 0.09 & 0\%  & 0.08 & 0\%   & 0.40 \\
DENTEX & Gemini & AP & 0.20 & 8\%  & 0.19 & 1\%  & 0.40 \\
DENTEX & Claude & AP & 0.02 & 0\%  & 0.12 & 0\%  & 0.40 \\
\midrule

\multicolumn{8}{l}{\textit{\textbf{Classification Tasks}}} \\
TopCoW-CTA-Cls & GPT  & Accuracy & 0.33 & 33\% & 0.10 & 0\%  & 0.73 \\
TopCoW-CTA-Cls & Gemini & Accuracy & 0.35 & 25\% & 0.33 & 25\% & 0.73 \\
TopCoW-CTA-Cls & Claude & Accuracy & 0.33 & 25\% & 0.33 & 25\% & 0.73 \\
\midrule

\multicolumn{8}{l}{\textit{\textbf{Image Quality $\&$ Enhancement Tasks}}} \\
LDCT-IQA & GPT  & Score & 2.62 & 33\% & 2.50 & 0\%  & 2.74 \\
LDCT-IQA & Gemini & Score & 2.70 & 83\% & 2.62 & 17\% & 2.74 \\
LDCT-IQA & Claude & Score & 1.12 & 0\%  & \textcolor{red}{FAIL} & -- & 2.74 \\
\bottomrule
\end{tabular}
\end{table*}


% \begin{table*}[t]
% \centering
% \footnotesize	
% \caption{Agent performance across 4 representative ReX-MLE challenges with different underlying LLMs (GPT-5, Gemini, or Claude). Values are metric scores (primary evaluation metric from the competition leaderboard). Percentile ranks in parentheses are based on a \textbf{mean rank of all evaluation metrics}. \textbf{Bold} denotes the best-performing agent–model combination per task.}
% \label{tab:agent_performance}
% \begin{tabular}{llllll}
% \toprule
% \textbf{Challenge} & \textbf{Model} & \textbf{Metric}  & \textbf{AIDE} & \textbf{ML-Master} & \textbf{Winner} \\
% \midrule

% \multicolumn{6}{l}{\textit{\textbf{Segmentation Tasks}}} \\
% ISLES'22 & GPT-5 & Dice $\uparrow$ & 0.04 (0\%) & 0.02 (0\%) & 0.83 \\
% ISLES'22 & Gemini & Dice $\uparrow$ & FAIL & 0.46 (0\%) & 0.83 \\
% ISLES'22 & Claude & Dice $\uparrow$ & 0.45 (0\%) & 0.65 (0\%) & 0.83 \\
% \midrule

% \multicolumn{6}{l}{\textit{\textbf{Detection Tasks}}} \\
% DENTEX & GPT-5 & AP $\uparrow$ & 0.09 (0\%) & FAIL & 0.40 \\
% DENTEX & Gemini & AP $\uparrow$ & 0.20 (6\%) & 0.19 (1\%) & 0.40 \\
% DENTEX & Claude & AP $\uparrow$ & 0.02 (0\%) & 0.11 (0\%) & 0.40 \\
% \midrule

% \multicolumn{6}{l}{\textit{\textbf{Classification Tasks}}} \\
% TopCoW-CTA-Cls & GPT-5 & Accuracy $\uparrow$ & 0.33 (33\%) & 0.02 (0\%) & 0.73 \\
% TopCoW-CTA-Cls & Gemini & Accuracy $\uparrow$ & 0.35 (25\%) & 0.33 (21\%) & 0.73 \\
% TopCoW-CTA-Cls & Claude & Accuracy $\uparrow$ & 0.33 (33\%) & 0.33 (21\%) & 0.73 \\
% \midrule

% \multicolumn{6}{l}{\textit{\textbf{Image Quality $\&$ Enhancement Tasks}}} \\
% LDCT-IQA & GPT-5 & Score $\uparrow$ & 2.62 (29\%) & 2.56 (0\%) & 2.74 \\
% LDCT-IQA & Gemini & Score $\uparrow$ & 2.70 (83\%) & 2.61 (14\%) & 2.74 \\
% LDCT-IQA & Claude & Score $\uparrow$ & 1.12 (0\%) & FAIL & 2.74 \\
% \bottomrule
% \end{tabular}
% \end{table*}




% \begin{table*}[t]
% \centering
% \footnotesize	
% \caption{3 Day agent performance across 4 representative ReX-MLE challenges. Values are metric scores (primary evaluation metric from the competition leaderboard). Percentile ranks in parenthesis are based on a mean rank of all evaluation metrics for a challenge (not just primary metric).
% \textbf{Bold} denotes the best-performing agent–model combination per task.}
% \label{tab:3_day_performance}
% \begin{tabular}{lllllc}
% \toprule
% \textbf{Challenge} & \textbf{Metric} & \textbf{AIDE} & \textbf{ML-Master} & \textbf{R$\&$D Agent} & \textbf{Winner} \\
% \midrule
% ISLES'22 & Dice $\uparrow$ & 0.18 (0\%) & 0.05 (0\%) & 0.04 (0\%) & 0.79 \\
% DENTEX & AP $\uparrow$ & 0.10 (0\%) & 0.02 (0\%) & FAIL & 0.40 \\
% TopCoW-CTA-Cls & Accuracy $\uparrow$ & 0.31 (25\%) & 0.33 (25\%) & 0.26 (25\%) & 0.73 \\
% LDCT-IQA & Score $\uparrow$ & 2.24 (0\%) & 2.43 (0\%) & 2.7 (71\%) & 2.74 \\
% \midrule
% Overall & Mean Rank $\uparrow$ & 6.25\% & 6.25\% & 24\% & / \\
% \bottomrule
% \end{tabular}
% \end{table*}

% To be completed



% \begin{table*}[t]
% \centering
% \footnotesize	
% \caption{Agent and baseline performance across Kaggle Medical ML challenges. Metrics and thresholds from official competition leaderboards. \textbf{Bold} indicates best score per challenge. Gold threshold represents official Kaggle medal cutoff.}\vspace{3pt}
% \label{tab:kaggle_performance}
% \begin{tabular}{lcccccc}
% \toprule
% \textbf{Challenge} & \textbf{Metric} & \textbf{Gold} & \textbf{AIDE} & \textbf{ML-Master} & \textbf{AIDE (w/ sol)} & \textbf{ML-Master (w/ sol)} \\
% \midrule
% \multicolumn{7}{l}{\textit{\textbf{Classification Tasks}}} \\
% histopathologic-cancer-detection & AUC & 0.9835 & \textbf{0.99686} & 0.99754 \\
% ranzcr-clip-catheter-line & AUC & 0.9735 & 0.8755 & 0.8637 \\
% siim-isic-melanoma & AUC & 0.9455 & 0.8474 & 0.8929 \\
% rsna-miccai-brain-tumor & AUC & 0.6009 & 0.5691 & \textbf{0.6117} \\
% \midrule
% \multicolumn{7}{l}{\textit{\textbf{Segmentation Tasks}}} \\
% hubmap-kidney-segmentation & Dice & 0.9484 & 0.0149 & 0.0974 \\
% uw-madison-gi-tract & Dice & 0.8791 & 0.06548 & 0.0473 \\
% \midrule
% \multicolumn{7}{l}{\textit{\textbf{Detection Tasks}}} \\
% rsna-2022-cervical-spine & F1 & 0.276 & \textbf{0.7706} & 0.6729 \\
% rsna-breast-cancer-detection & pF1 & 0.49 & --- & --- \\
% siim-covid19-detection & mAP & 0.623 & 0.19073 & \textbf{0.3695} \\
% vinbigdata-chest-xray & mAP & 0.289 & --- & --- \\
% \midrule
% \multicolumn{7}{l}{\textit{\textbf{Regression Tasks}}} \\
% osic-pulmonary-fibrosis & Laplace LL & $-6.8412$ & \textbf{$-7.7310$} & $-9.1825$ \\
% \bottomrule
% \end{tabular}
% \end{table*}

% \subsection{What are the Critical Gaps Between AI Agents and Medical Imaging Winners?}

% To systematically understand \textit{why} agents fail on medical imaging challenges, we evaluate their capabilities against the 20 winning strategies identified by Eisenmann et al.~\cite{eisenmann2023winner} from comprehensive analysis of biomedical competition winners. Table~\ref{tab:winning_strategies} presents the evaluation results, showing each agent's capability score averaged across all 16 tasks.

% \subsection{How Can We Improve Current AI Agents?}
% \subsubsection{Time Analysis}
% The four state-of-the-art AI agents we evaluated were designed with efficiency and rapid iteration in mind. These agents typically run with a 24-hour time limit and are therefore forced to plan, debug, train, and perform inference within that window. In real-world medical imaging scenarios, competitions last months at a time, giving competitors ample time to do exploratory data analysis, find related work and competitions, train over many days, and run many inference iterations. Within our benchmark, agents spent on average XX minutes writing code, XX minutes debugging, and XX hours training. As evidenced by their performance, this clearly was not sufficient time for in-depth understanding of the challenge and how to succeed.

