When Do LLMs Improve Bayesian Optimization? A Systematic Comparison Across Molecular and Protein Design

Mattias Akke; Soojung Yang; Jurģis Ruža; Jinyeop Song; Elton Pan; Rafael Gomez-Bombarelli

When Do LLMs Improve Bayesian Optimization? A Systematic Comparison Across Molecular and Protein Design

Mattias Akke, Soojung Yang, Jurģis Ruža, Jinyeop Song, Elton Pan, Rafael Gomez-Bombarelli

Published: 23 Sept 2025, Last Modified: 23 Dec 2025SPIGM @ NeurIPSEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Bayesian Optimization, Active Learning, LLM, Drug Discovery, Molecular Optimization, Protein Design

Abstract: While powerful, classical Bayesian Optimization (BO) and active learning methods struggle to incorporate complex prior knowledge, provide limited interpretability in explaining why a candidate looks promising, and can be computationally demanding. Large language models (LLMs) offer complementary strengths in reasoning ability and integration of domain knowledge, but it remains unclear **when** and **how** they can reliably improve BO campaigns. We reconcile previous reports by providing a systematic comparison of various LLM-based approaches (off-the-shelf reasoning LLMs relying on in-context learning, fine-tuned on synthetic BO, and light-weight agentic workflows using tools) against classical statistical BO across molecular optimization and protein design tasks. We find that off-the-shelf reasoning LLMs fail in SMILES-based molecular optimization due to their poor handling of SMILES representations and large in-context inputs, but agentic workflows that leverage cheminformatics tools and statistical model-based filtering overcome these limitations. In contrast, in the design of four-residue protein motifs, pure reasoning LLMs excel by generating domain knowledge-driven hypotheses, while agentic workflows underperform, relying too heavily on tools. These results highlight the complementarity of reasoning models and agentic architectures, offering guidance on when each is preferable. Finally, we show that non-reasoning LLMs trained via supervised fine-tuning (SFT) can efficiently mimic statistical strategies in our setting, sometimes outperforming reasoning models at a fraction of the computational cost. Together, our findings clarify the respective roles and failure modes of reasoning, agentic, and statistical approaches in BO, and propose a path toward hybrid methods that combine the strengths of LLM-hypothesis generation and statistical rigor.

Submission Number: 93

Loading