Bayesian Optimization for Protein Sequence Design: Back to Simplicity with Gaussian Processes

Carolin Benjamins; Shikha Surana; Oliver Bent; Marius Lindauer; Paul Duckworth

Bayesian Optimization for Protein Sequence Design: Back to Simplicity with Gaussian Processes

Carolin Benjamins, Shikha Surana, Oliver Bent, Marius Lindauer, Paul Duckworth

Published: 08 Oct 2024, Last Modified: 03 Nov 2024AI4Mat-NeurIPS-2024EveryoneRevisionsBibTeXCC BY 4.0

Submission Track: Short Paper

Submission Category: AI-Guided Design

Keywords: Bayesian Optimization, Gaussian Process, String kernels, Fingerprint kernels, encoding, protein sequence design

TL;DR: Classic BO with GPs is competetive with PLMs as surrogates for protein sequence design.

Abstract: Bayesian optimization (BO) is a popular sequential decision making approach for maximizing black-box functions in low-data regimes. In biology, it has been used to find well-performing protein sequence candidates since gradient information is not available from in vitro experimentation. Recent in silico design methods have leveraged large pre-trained protein language models (PLMs) to predict protein fitness. However PLMs have a number of shortcomings for sequential design tasks: i) their current limitation to model uncertainty, ii) the lack of closed-form Bayesian updates in light of new experimental data, and iii) the challenge of fine-tuning on small downstream task datasets. We take a step back to traditional BO by investigating Gaussian process (GP) surrogate models with various sequence kernels, which are able to properly model uncertainty and update their belief over multi-round design tasks. We empirically evaluate our method on the sequence design benchmark ProteinGym, and demonstrate that BO with GPs is competitive with large SOTA pre-trained PLMs at a fraction of the compute budget.

Submission Number: 36

Loading