Keywords: AI evaluations, AI benchmarks, biosecurity, Bio-AI Models
TL;DR: We introduce ABLE, a biosecurity benchmark assessing an LLM's ability to use bio-AI models to redesign a viral protein.
Abstract: Tool use is an emerging capability of agentic large language models (LLMs), allowing them to interact with external systems across domains. In biology, there has been no systematic investigation of how well LLMs can wield specialized biological AI models (BAIMs) to perform dual-use protein engineering workflows, which is essential for enabling the benefits of powerful AI systems and preventing misuse. To empirically assess how LLMs interact with BAIMs in biosecurity-relevant contexts, we introduce the Agentic BAIM–LLM Evaluation (ABLE), a benchmark that evaluates an LLM agent's ability to use BAIMs like ProteinMPNN and AlphaFold3 in a dual-use protein design workflow, focusing on redesigning a viral protein to enhance its pathogenic properties while maintaining structural stability. The evaluation suite assesses key capabilities such as protein structure retrieval, design approach, sequence variant generation using ProteinMPNN, and validation via interpreting AlphaFold3 outputs. We implement ABLE in the Inspect AI framework, providing models with natural language prompts, controlled tool access, and automated scoring. We evaluate six frontier models on ABLE, finding that the models differ markedly in both safety behaviors and task performance. Three models refused to attempt all tasks, while those that did not refuse varied in their ability to successfully perform tasks. Our results suggest that current LLMs can lower barriers to protein design by handling information retrieval, tool identification, and, in some cases, direct tool use. However, at present, even leading models remain inconsistent in planning, strategy generation, environment navigation, and incorporating biological information into their tool use. ABLE serves as a systematic way to measure these capabilities and their limitations.
Submission Number: 29
Loading