Keywords: AI agents, agent evaluation, spatial transcriptomics, trace inspection, benchmarks, computational biology
TL;DR: We created 40 spatial transcriptomics tasks and found that giving coding agents more help often made them perform worse, because they leaned too much on the extra context instead of checking what the data actually showed.
Abstract: Building and evaluating scientific coding agents requires not only executable tasks but also human-in-the-loop verification for ambiguous and non-verifiable outcomes. Most benchmarks to date have been built bespoke by experts and scored via rubrics on final outputs. Here we explore two interactive augmentations: (1) building benchmark datasets from peer-reviewed papers, and (2) inspecting agent traces to see how workflows emerge. We construct 40 spatial transcriptomics alignment tasks, a challenging problem in computational biology where agents submit coordinate tables aligning a pair of 2D slices. Across 120 runs under each of three configurations---basic prompt, package-aware prompt, and full prompt plus prebuilt virtual environment---we find that more package hints increased tool exploration but did not improve performance: the full regime scored lower than Basic (0.361 vs. 0.428; 95% CI [-0.113, -0.028]). Trace inspection shows that richer scaffolding encouraged unnecessary transformations on already-aligned inputs, fragile package-first workflows, and infrastructure failures. Our results highlight that scientific-agent evaluations should therefore involve interactive construction and trace inspection to balance the variance and bias introduced by added context and tools.
Submission Number: 57
Loading