Towards Autonomous Experimentation: BioProBench, a Corpus and Benchmark for Biological Protocol Comprehension

08 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Biological Protocol, Dataset and Benchmark, LLMs
Abstract: The automation of scientific experimentation is critically hindered by the inability of Large Language Models (LLMs) to reliably comprehend the specialized, accuracy-critical, and procedural nature of biological protocols. To address this fundamental challenge, we present **BioProBench**, a comprehensive resource for procedural reasoning in biology. BioProBench is grounded in a foundational corpus of 27,000 human-written protocols. From this corpus, we systematically constructed a dataset of over 550,000 task instances, partitioning it into a large-scale training set and a rigorous benchmark with a held-out test set and novel evaluation metrics. Our comprehensive evaluation of 10 mainstream LLMs on the benchmark reveals a critical performance gap: while models excel on basic comprehension tasks, they underperform on tasks requiring deep procedural logic, quantitative accuracy, and safety-critical reasoning. To demonstrate the value of our corpus in mitigating these issues, we developed **ProAgent**, a Retrieval-Augmented Generation (RAG) agent. Grounded in our corpus, ProAgent substantially advances the state-of-the-art. BioProBench thus provides both a rigorous diagnostic benchmark and a foundational resource for developing the next generation of reliable AI for science. The code and data are available at: https://anonymous.4open.science/r/Anonymization-112358/README.md.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 3119
Loading