Combining Structure and Sequence for Superior Fitness Prediction

Published: 27 Oct 2023, Last Modified: 23 Nov 2023GenBio@NeurIPS2023 PosterEveryoneRevisionsBibTeX
Keywords: Generative AI, Protein language model, Inverse folding, protein fitness prediction, protein design
TL;DR: We find inverse folding models predict mutation effects on stability and leverage this to develop a sequence-structure hybrid model with SOTA performance on mutation effect prediction.
Abstract: Deep generative models of protein sequence and inverse folding models have shown great promise as protein design methods. While sequence-based models have shown strong zero-shot mutation effect prediction performance, inverse folding models have not been extensively characterized in this way. As these models use information from protein structures, it is likely that inverse folding models possess inductive biases that make them better predictors of certain function types. Using the collection of model scores contained in the newly updated ProteinGym, we systematically explore the differential zero-shot predictive power of sequence and inverse folding models. We find that inverse folding models consistently outperform the best-in-class sequence models on assays of protein thermostability, but have lower performance on other properties. Motivated by these findings, we develop StructSeq, an ensemble model combining information from sequence, multiple sequence alignments (MSAs), and structure. StructSeq achieves state-of-the-art Spearman correlation on ProteinGym and is robust to different functional assay types.
Submission Number: 65
Loading