Minibinder Lab: The Reliability Gap Of Agents For Designing High Quality Protein Binders

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine Learning, Agents, Binders, Proteins, LLMs, Reasoning
TL;DR: We introduce Minibinder Lab, an agentic framework for iterative design and optimisation of minibinders, and we explore via systematic analysis performance, critical safety-relevant failure modes, in collaboration with domain experts.
Abstract: We introduce Minibinder Lab, an agentic framework for iterative design and optimisation of minibinders, small protein structures emerging as a key modality in therapeutic drug discovery. While recent generative models enable rapid in silico candidate generation, selecting and improving binders remains expert-driven, requiring target-specific reasoning, structural inspection, interface scoring, and sequence novelty assessment. Minibinder Lab systematises this process through 12 specialised agents that coordinate generation, analysis, critique, and refinement across design rounds, combining large language model reasoning with established structural and sequence-level tools alongside target-specific literature evidence. We evaluate the framework on three therapeutically relevant targets, PD-L1, PDGFR, and CD3, using two open- and two closed-source large language models, showing that agents reason over multi-source protein design data and propose redesign strategies judged reasonable by two domain experts. However, systematic analysis of agent behaviour exposes a critical safety-relevant failure mode: while reasoning models successfully integrate literature evidence into design decisions, they cannot self-correct when selecting incorrect tool configurations, persisting with erroneous choices despite contradictory downstream outputs. In a high-stakes scientific domain where unchecked agent errors could propagate flawed candidates into experimental pipelines, this inability to recognize and recover from its own mistakes represents a concrete reliability boundary that must be addressed before such systems can be deployed autonomously in the wild.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 199
Loading