\section{Failure Analysis}
\label{sec:failure_analysis}

The 0\% medal rate across all agents and challenges demands systematic investigation. We develop a 20-point evaluation framework examining six capability dimensions, enabling quantitative comparison of failure patterns.

\subsection{Analysis Methodology}

For each agent-challenge combination, we manually analyze complete execution traces, generated code, error messages, and attempted approaches. We score each dimension on established criteria based on comparison with winning solutions~\citep{eisenmann2023winner}. Each dimension comprises 3-4 specific criteria scored 0-2 points (0: capability absent, 1: partially present, 2: adequately demonstrated).

Our six dimensions span the full ML development pipeline:
\begin{itemize}[leftmargin=*, nosep]
    \item \textbf{Domain Knowledge} (4 pts): Medical terminology, anatomy, clinical context, modality characteristics
    \item \textbf{Data Handling} (4 pts): 3D processing, memory management, format handling, preprocessing pipelines
    \item \textbf{Architecture Selection} (3 pts): Domain-appropriate models, specialized architectures, adaptations
    \item \textbf{Evaluation Metrics} (3 pts): Metric implementation, multi-class handling, optimization alignment
    \item \textbf{Search Strategy} (3 pts): Exploration depth, technique coverage, iteration efficiency
    \item \textbf{Implementation} (3 pts): Code quality, resource management, debugging capability
\end{itemize}

\subsection{Dimension 1: Domain Knowledge (4 points)}

\textbf{Average scores: AIDE 0.6/4, ML-Master 0.8/4, M3Builder 1.2/4, InternAgent 0.7/4}

Agents fundamentally lack medical imaging domain knowledge. They fail to understand that vessel segmentation requires different techniques than organ segmentation, or that pathology images demand multi-resolution processing.

\textit{Example (TopCoW vessel segmentation):} All agents except M3Builder attempted 2D classification models despite task descriptions explicitly mentioning "3D time-of-flight MRA volumes" and "Circle of Willis vessels." Winning solutions used 3D nnU-Net with vessel-specific preprocessing (intensity normalization, small object removal). Agents ignored these domain standards.

\textit{Example (PUMA pathology):} Agents failed to recognize that whole-slide images exceed GPU memory limits, requiring patch-based processing. Winners used multi-scale pyramidal approaches with context windows. Agents attempted to load entire images, causing memory crashes or severe downsampling that lost cellular detail.

M3Builder's modest advantage (1.2/4) comes from recognizing U-Net for segmentation, but it still lacks deeper knowledge like when to use 3D vs 2D variants, appropriate loss functions for class imbalance, or modality-specific augmentation strategies.

\subsection{Dimension 2: Data Handling (4 points)}

\textbf{Average scores: AIDE 0.9/4, ML-Master 1.1/4, M3Builder 1.6/4, InternAgent 1.0/4}

3D medical imaging processing proves particularly challenging. Agents struggle with basic requirements that winning solutions handle routinely.

\textit{Memory management:} Agents frequently run out of memory loading 3D volumes. Winners use patch-based training (extracting 64×64×64 or 128×128×128 crops), gradient checkpointing, and mixed precision. Agents either load full volumes causing crashes or downsample excessively (512×512×300 → 128×128×75), losing critical detail.

\textit{Format understanding:} Medical formats (DICOM, NIfTI) contain metadata crucial for preprocessing—voxel spacing, orientation, intensity ranges. Winners normalize based on scanner-specific Hounsfield units (CT) or rescale using percentile clipping (MRI). Agents treat these as generic images, applying ImageNet normalization (mean=[0.485, 0.456, 0.406]) to grayscale medical volumes.

\textit{Preprocessing pipelines:} Winners apply domain-standard preprocessing: resampling to isotropic spacing, intensity windowing, Z-score normalization per scan. Agents skip these steps or apply them incorrectly (e.g., normalizing entire dataset jointly instead of per-volume, breaking test-time deployment).

M3Builder scores higher (1.6/4) by handling NIfTI format and implementing basic preprocessing, but still fails on memory management for large volumes.

\subsection{Dimension 3: Architecture Selection (3 points)}

\textbf{Average scores: AIDE 0.7/3, ML-Master 0.9/3, M3Builder 1.3/3, InternAgent 0.8/3}

Agents consistently choose inappropriate architectures, defaulting to standard computer vision models instead of medical imaging standards.

\textit{Segmentation tasks:} Winners universally used U-Net variants (nnU-Net, 3D U-Net, residual U-Net). AIDE and InternAgent primarily tried ResNet50 + decoder, VGG + FCN, or DeepLabV3 from torchvision. These lack the multi-scale processing and skip connections essential for precise medical segmentation. ML-Master explored more options but still favored standard architectures over domain-specific ones.

\textit{Detection tasks:} For 3D aneurysm detection (TopCoW), winners adapted Faster R-CNN to 3D or used specialized vessel detection networks. Agents used 2D Faster R-CNN on individual slices, losing volumetric context critical for distinguishing true aneurysms from normal vessel bifurcations.

\textit{Multi-task scenarios:} DENTEX requires simultaneous detection, enumeration (counting teeth), and diagnosis (classifying diseases). Winners used multi-head architectures sharing a backbone. Agents treated this as pure object detection, ignoring the classification component entirely.

M3Builder's advantage (1.3/3) stems from preferring U-Net for segmentation. However, it fails to select appropriate variants (2D vs 3D) or configure them optimally (patch size, network depth).

\subsection{Dimension 4: Evaluation Metric Understanding (3 points)}

\textbf{Average scores: AIDE 0.8/3, ML-Master 1.0/3, M3Builder 1.4/3, InternAgent 0.9/3}

Medical metrics differ fundamentally from ImageNet accuracy or COCO mAP. Agents misunderstand these metrics, leading to optimization misalignment.

\textit{Dice vs IoU:} Agents confuse Dice coefficient with IoU, or implement Dice loss incorrectly (missing smoothing terms that prevent division by zero for empty predictions). One agent optimized cross-entropy loss while evaluation used Dice, creating a train-test objective mismatch.

\textit{Panoptic Quality (PQ):} PUMA uses PQ = (TP/(TP+0.5×FP+0.5×FN)) × (Dice across matched objects). This combines detection accuracy and segmentation quality. Agents treated this as pure segmentation, optimizing Dice without considering instance separation. Winners used instance segmentation (Mask R-CNN variants) with watershed post-processing.

\textit{Multi-class metrics:} TopCoW evaluates 13 vessel classes separately. Final score averages per-class Dice. Agents optimized macro-average Dice, which underweights small vessels that are individually critical. Winners used weighted losses emphasizing difficult small classes.

\textit{Hausdorff distance:} Measures worst-case boundary error. Critical for radiation therapy planning where even small errors matter. No agent implemented this metric; some confused it with average surface distance.

M3Builder implements standard medical metrics correctly (Dice, mAP) but still misses nuances like class weighting and composite metrics.

\subsection{Dimension 5: Search Strategy (3 points)}

\textbf{Average scores: AIDE 1.1/3, ML-Master 1.5/3, M3Builder 1.4/3, InternAgent 1.2/3}

Despite sophisticated search algorithms, agents fail to explore domain-relevant solution spaces.

\textit{Premature convergence:} AIDE's greedy search tries 3-5 approaches before settling on one. Winners report testing 15-20 configurations. ML-Master explores more breadth but allocates iterations poorly—spending equal effort on clearly unsuitable approaches (e.g., 1D CNNs for 3D segmentation) and promising ones.

\textit{Missing domain techniques:} Agents rarely attempt techniques standard in winning solutions: ensemble methods (used by 50\% of winners), multi-stage pipelines (61\%), test-time augmentation (40\%). These techniques are well-documented in competition forums and papers, yet agents don't discover or attempt them.

\textit{Inefficient iteration:} Agents waste time re-implementing basic components rather than leveraging libraries. Winners use nnU-Net, MONAI, or MedicalZoo—production medical imaging frameworks. Agents implement U-Net from scratch, introducing bugs and missing optimizations.

ML-Master's MCTS provides some advantage (1.5/3) through broader exploration, but still fails to prioritize domain-standard approaches over random variations.

\subsection{Dimension 6: Implementation Quality (3 points)}

\textbf{Average scores: AIDE 0.8/3, ML-Master 1.1/3, M3Builder 1.3/3, InternAgent 0.9/3}

Even when agents select reasonable approaches, implementation failures prevent success.

\textit{Library compatibility:} Agents mix incompatible library versions (SimpleITK 2.1 with numpy 1.24, causing dimension ordering bugs). Winners use tested environments documented in prior competitions.

\textit{GPU memory crashes:} Agents fail to implement fallback strategies when memory is insufficient. Winners progressively reduce batch size, patch size, or network depth. Agents crash and retry the same configuration.

\textit{Data augmentation:} 3D augmentation differs from 2D—elastic deformations must preserve anatomical plausibility, rotations must handle anisotropic voxels. Agents apply 2D augmentation independently to slices, breaking 3D structure. Or they skip augmentation entirely, despite 100\% of winners using it.

\textit{Debugging capability:} When validation Dice is 0.0, agents rarely diagnose the issue (often one-hot encoding mismatch or missing sigmoid activation). They retry with different models rather than fixing the bug. Winners systematically debug by visualizing predictions, checking data loading, and verifying metric calculations.

\textit{Invalid submissions:} 4-12 challenges per agent produced invalid submissions (wrong format, missing files, dimension mismatches). These are specification errors—task descriptions provide example submissions and format details. Winners achieve 100\% submission validity; agents fail basic requirements.

\subsection{Quantitative Failure Patterns}

Table~\ref{tab:failure_scores} summarizes aggregate scores across dimensions.

\begin{table}[h]
\centering
\small
\caption{Average failure analysis scores (out of 20 points) across six capability dimensions. All agents score below 35\% on every dimension, revealing systematic capability gaps.}
\label{tab:failure_scores}
\begin{tabular}{@{}lcccccc@{}}
\toprule
\textbf{Dimension} & \textbf{Max} & \textbf{AIDE} & \textbf{ML-Master} & \textbf{M3Builder} & \textbf{InternAgent} \\
\midrule
Domain Knowledge & 4 & 0.6 & 0.8 & 1.2 & 0.7 \\
Data Handling & 4 & 0.9 & 1.1 & 1.6 & 1.0 \\
Architecture Selection & 3 & 0.7 & 0.9 & 1.3 & 0.8 \\
Evaluation Metrics & 3 & 0.8 & 1.0 & 1.4 & 0.9 \\
Search Strategy & 3 & 1.1 & 1.5 & 1.4 & 1.2 \\
Implementation & 3 & 0.8 & 1.1 & 1.3 & 0.9 \\
\midrule
\textbf{Total} & \textbf{20} & \textbf{4.9} & \textbf{6.4} & \textbf{8.2} & \textbf{5.5} \\
\textbf{Percentage} & -- & \textbf{24.5\%} & \textbf{32.0\%} & \textbf{41.0\%} & \textbf{27.5\%} \\
\bottomrule
\end{tabular}
\end{table}

\textbf{Key observations:}

\textbf{Uniform failure across dimensions.} No agent exceeds 50\% on any dimension. Even M3Builder, designed for medical imaging, scores only 41\% overall. This contrasts sharply with the 47.7\% medal rate (requiring ~70\% capabilities) achieved by ML-Master on general ML tasks.

\textbf{Limited benefit of specialization.} M3Builder's medical focus provides only 9-13 percentage points advantage over general agents. Its improvements concentrate in predictable areas (architecture selection, data handling) but fail to extend to deeper domain knowledge or superior search strategies. This suggests current specialization approaches are superficial—providing templates without true domain understanding.

\textbf{Search sophistication insufficient.} ML-Master's advanced MCTS provides minimal advantage (32\% vs 24.5\% for AIDE). On general MLE-bench, this same MCTS drives a 3× improvement (47.7\% vs 16.9\%). The contrast reveals that sophisticated search helps when the solution space contains discoverable patterns, but fails when domain knowledge is required to identify viable approaches.

\textbf{No single dimension dominates.} Failures span all aspects from knowledge to implementation. This rules out simple fixes—agents need comprehensive enhancement, not just better architecture selection or improved data handling.

The quantitative scores validate our qualitative observations: current agents lack the foundational capabilities required for medical imaging automation. Section~\ref{sec:case_studies} examines specific challenges to understand how these capability gaps manifest in practice.
