ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity

Published: 15 Oct 2025, Last Modified: 24 Nov 2025BioSafe GenAI 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: biosecurity, benchmark, evaluations, llms, agents, biology
TL;DR: We introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of evaluations to measure LLM-based agents' performance of biosecurity-relevant tasks, and find that leading LLMs already match or exceed expert human performance on ABC-Bench.
Abstract: LLMs are increasingly useful for research in the life sciences. For some time, LLMs have been able to output detailed and accurate scientific information, but now leading LLM-based tools are also able to perform certain in silico tasks that had previously been the exclusive domain of experienced biologists. These emerging AI capabilities offer new opportunities for scientific discovery and biomedical advances, but they are also changing the landscape of biosecurity risks. Therefore, it is important to be able to rigorously measure task-based capabilities of AI models. To address this, we introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of evaluations to measure agentic biosecurity-relevant capabilities. Unlike fact-based tests, agentic benchmarks assess whether AI agents can perform complex tasks end-to-end. ABC-Bench evaluates LLM-based agents on both benign and potentially harmful biosecurity-relevant tasks: writing code to operate liquid handling robots, designing DNA fragments for in vitro assembly, and evading DNA synthesis screening. These tasks require a combination of biology and software expertise; indeed, when PhD biologists with at least two years of coding experience attempted the tasks in ABC-Bench, they scored only 24\% on average. By contrast, the top-performing LLM, Grok 3, achieves 53\% across tasks, outperforming 60\%, 100\%, and 54\% of experts on the Liquid Handling Robot, Fragment Design, and Screening Evasion tasks, respectively. We further tested whether model-generated code could execute in a real laboratory. OpenAI's GPT-4o-mini-high produced code that, when run on an OpenTrons robot, successfully assembled DNA with the expected sequences in three independent experiments. These findings demonstrate that LLMs can agentically perform biosecurity-relevant tasks, highlighting an important new dimension of AI usage in biosecurity.
Submission Number: 11
Loading