Keywords: Biological Data Engineering, Benchmark, Agent
Abstract: Automating the intricate process of dataset construction, known as Biological Data Engineering (BDE), is a grand challenge for autonomous AI agents and a critical bottleneck in scientific discovery. While Large Language Models (LLMs) show promise, their application is hampered by the absence of a rigorous benchmark to guide and evaluate agent development in this domain. To address this gap, we introduce \benchmark, the first comprehensive benchmark designed to operationalize BDE and drive progress in scientific automation. \benchmark features 114 realistic tasks curated from 150 peer-reviewed biological publications. It systematically tackles core scientific challenges by: (1) managing procedural ambiguity with clear goals but open-ended execution paths; (2) establishing intermediate ground truth by manually replicating each task with tractable data; and (3) enabling complex, multi-modal evaluation through custom, domain-aware evaluators for specialized scientific data formats beyond simple string matching. We conduct an extensive evaluation of state-of-the-art agents powered by models such as GPT-4.1, Claude 4, and Gemini 2.5. Our results reveal that while these models exhibit nascent capabilities, their overall success rates are modest, exposing a significant performance gap. We identify critical and recurrent failure modes, including struggles with multi-step tool chaining, hallucination of tool parameters, inability to parse scientific file formats, and a lack of long-horizon reasoning. These findings not only validate the challenging nature of BDE but also provide a granular, empirical roadmap for the community to develop more robust and reliable scientific agents.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 12323
Loading