BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics
Keywords: agents, bioinformatics, benchmark, evaluation suite, dataset
TL;DR: BioAgent Bench is a benchmark dataset and evaluation harness for stress-testing LLM agents on realistic end-to-end bioinformatics workflows that require tool use, file handling, and structured artifact generation.
Abstract: We introduce BioAgent Bench, an evaluation suite designed for measuring the performance and robustness of AI agents in common bioinformatics tasks. The suite consists of manually curated end-to-end tasks (e.g., RNA-seq, variant calling, metagenomics) accompanied by task-specific prompts and concrete output artifacts to support automated assessment. We evaluate frontier closed- and open-weight models across multiple agent harnesses, and use an LLM-based grader to score pipeline progress and outcome validity. We find that agents based on frontier LLMs can complete multi-step bioinformatics pipelines without elaborate custom scaffolding, often producing the requested final artifacts reliably. However, robustness tests reveal failure modes under controlled perturbations (corrupted inputs, decoy files, and prompt bloat), indicating that correct high-level pipeline construction does not guarantee reliable step-level reasoning. Finally, bioinformatics workflows often involve sensitive patient data or unpublished intellectual property, thereby making the use of cost-effective yet reliable local agents an imperative. Therefore, by releasing the code and the complementary resources comprising our suite, we aim to accelerate the development of such privacy-preserving agents.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 82
Loading