Abstract: Bayesian optimization, which uses a probabilistic surrogate for an expensive black-box function, provides a framework for protein design that requires a small amount of labeled data. In this paper, we compare three approaches to constructing surrogate models for protein design on synthetic benchmarks. We find that neural network ensembles trained directly on primary sequences outperform string kernel Gaussian processes and models built on pre-trained embeddings. We show that this superior performance is likely due to improved robustness on out-of-distribution data. Transferring these insights into practice, we apply our approach to optimizing the Stokes shift of green fluorescent protein, discovering and synthesizing novel variants with improved functional properties.
0 Replies
Loading