Effective Surrogate models for protein design with Bayesian optimization

Nate Gruver, Samuel Don Stanton, Polina Kirichenko, Marc Anton Finzi, Phillip Maffettone, Vivek Myers, Emily Delaney, Peyton Greenside, Andrew Gordon Wilson

18 Jun 2022OpenReview Archive Direct UploadReaders: Everyone

Abstract: Bayesian optimization, which uses a probabilistic surrogate for an expensive black-box function, provides a framework for protein design that requires a small amount of labeled data. In this paper, we compare three approaches to constructing surrogate models for protein design on synthetic benchmarks. We find that neural network ensembles trained directly on primary sequences outperform string kernel Gaussian processes and models built on pre-trained embeddings. We show that this superior performance is likely due to improved robustness on out-of-distribution data. Transferring these insights into practice, we apply our approach to optimizing the Stokes shift of green fluorescent protein, discovering and synthesizing novel variants with improved functional properties.

0 Replies