\section{Related Work}

% \begin{table*}[htbp]
% \centering
% % \small
% \caption{\textcolor{red}{Do we still need this table?} Comparison of ML agent benchmarks. \textbf{Real Grade:} Uses real competition leaderboards. \textbf{Output Eval:} Evaluates actual model outputs/submissions. \textbf{Domain Expert:} Requires domain expertise. \textbf{Strategy Analysis:} Understands agent pitfalls and strengths.} 
% \begin{tabular}{lcccc}
% \toprule
% & \multicolumn{4}{c}{\textbf{Evaluation}} \\
% \cmidrule{2-5}
% & Real & Output & Domain & Strategy \\
% & Grade & Eval & Expert & Analysis \\
% \midrule
% MLE-Bench \cite{chan2024mle} & \textcolor{green}{\cmark} & \textcolor{red}{\xmark} & \textcolor{red}{\xmark} & \textcolor{red}{\xmark} \\
% SWE-Bench \cite{jimenez2023swe} & \textcolor{green}{\cmark} & \textcolor{green}{\cmark} & \textcolor{red}{\xmark} & \textcolor{red}{\xmark} \\
% M$^3$Bench \cite{feng2025m} & \textcolor{red}{\xmark} & \textcolor{red}{\xmark} & \textcolor{red}{\xmark} & \textcolor{red}{\xmark} \\
% ReX-MLE & \textcolor{green}{\cmark} & \textcolor{green}{\cmark} & \textcolor{green}{\cmark} & \textcolor{green}{\cmark} \\
% \bottomrule
% \end{tabular}
% \label{tab:benchmark_comparison}
% \end{table*}


\vspace{3pt} \noindent \textbf{Benchmarks for Evaluating AI Agents.}
Robust evaluation frameworks are essential for measuring AI agent capabilities across multiple dimensions. Code-focused benchmarks include HumanEval~\citep{chen2021evaluating} for code generation, CRUXEval~\citep{pmlr-v235-gu24c} for execution reasoning, and SWE-bench~\citep{jimenez2023swe} for real-world GitHub issues. ML workflow benchmarks provide comprehensive end-to-end evaluation: MLE-bench~\citep{chan2024mle} curates 75 Kaggle competitions revealing up to 50.67\% medal rates for leading agents, while ML-Bench~\citep{tang2023ml}, ML-Dev-Bench~\citep{padigela2025ml}, and TimeSeriesGym~\citep{cai2025timeseriesgym} evaluate repository-level and time series tasks. Domain-specific reasoning benchmarks test specialized knowledge, with mathematics evaluations evolving from saturated benchmarks like GSM8K~\citep{cobbe2021training} and MATH~\citep{hendrycks2021measuring} to harder challenges including Putnam-AXIOM~\citep{gulati2025putnam} and MATH-Perturb~\citep{huang2025math}, while LLM-SRBench~\citep{shojaee2025llm} evaluates scientific equation discovery.
However, existing benchmarks have critical limitations for evaluating domain expertise. They lack evaluation of: (1) real competition grades, (2) actual model outputs, (3) domain expert requirements, and (4) the agent's strategy. Most importantly, they do not systematically analyze \textit{why} agents fail or \textit{what specific capabilities} are missing, questions essential for advancing agent development toward genuine domain expertise.

\vspace{3pt} \noindent \textbf{AI Agents for Scientific Research.}
The vision of autonomous AI agents for scientific research has gained significant momentum, with recent systems demonstrating the potential to autonomously explore solution spaces~\citep{toledo2025ai}, conduct end-to-end experiments~\citep{team2025internagent}, and perform complex reasoning tasks. In machine learning engineering, agents like AIDE~\citep{schmidt2024aide}, ML-Master~\citep{liu2025ml}, and AIRA~\citep{toledo2025ai} employ tree search strategies to tackle Kaggle competitions, while R\&D-Agent~\citep{yang2025rdagentllmagentframeworkautonomous} achieves open-source state-of-the-art 35.1\% medal rate on MLE-bench through a modular framework separating idea generation from implementation. Beyond machine learning, agents have shown promise in automated scientific discovery~\citep{team2025internagent}, formulating hypotheses and designing experiments. However, a critical question remains: while these agents excel at general-purpose tasks, do they possess the specialized domain knowledge required for expert-level scientific work? Our work addresses this question by evaluating agents on medical imaging challenges that demand deep domain expertise, revealing fundamental limitations in current systems' ability to apply specialized knowledge.

\vspace{3pt} \noindent \textbf{Medical AI and Evaluation.}
The medical domain presents unique challenges for AI evaluation due to the critical importance of domain knowledge and specialized expertise. While traditional benchmarks like MedQA~\citep{jin2021disease} have reached saturation with models exceeding 90\% accuracy~\citep{zuo2025medxpertqa}, they suffer from limited clinical relevance and lack of construct validity~\citep{alaa2025medical}, prompting expert-level benchmarks like MedXpertQA~\citep{zuo2025medxpertqa}. Medical imaging competitions on platforms like Grand Challenge provide more realistic evaluation, attracting 200+ participants. Eisenmann et al.~\citep{eisenmann2023winner} conducted comprehensive analysis of biomedical competition winners, identifying 20 winning strategies including domain knowledge application, expert collaboration, data curation, and specialized preprocessing. However, a significant gap remains between the capabilities of winning humans and current MLE agents in addressing these challenges. Our work systematically evaluates agents on Grand Challenge competitions with realistic settings to analyze not just \textit{whether} agents fail but \textit{why} they fail and \textit{what specific domain capabilities} they lack, revealing fundamental limitations in current agents' ability to acquire and apply domain expertise, insights critical for developing the next generation of domain-aware autonomous systems.