Disentanglement of Evolutionary Constraints in Statistical Models of Proteins

Haobo Wang, Shihao Feng, Kotaro Tsuboyama, Sirui Liu, Gabriel J. Rocklin, Sergey Ovchinnikov

Published: 17 Apr 2024, Last Modified: 27 Sept 2024OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: The exponential growth of protein sequences in the post-genomic era has revolutionized the application of generative sequence models for pivotal tasks such as contact prediction, protein design, alignment, and homology search. Despite remarkable progress in these areas, the interpretability of the modeled pairwise parameters remains limited due to complexities arising from coevolution, phylogeny, and entropy. While post-correction methods for contact prediction have been developed to eliminate entropy-related contributions from predicted contact maps, there is currently no direct approach to correct entropy in other applications reliant on raw parameters. In this paper, we investigate the sources of entropy signal and propose a novel spectral regularizer, LH (an abbreviation of Henri Lebesgue), to mitigate its impact during model fitting. By incorporating this regularizer into the GREMLIN framework (utilizing a Markov random field or Potts model), we enable the accurate inference of sparse contact maps while simultaneously improving interpretability and addressing overfitting concerns critical for sequence evaluation and design. To validate the efficacy of our approach, we design multiple protein sequences based on GREMLIN with both L2 and LH regularizers, and subsequently experimentally measure their using cDNA display proteolysis. Our findings demonstrate that proteins designed using the LH regularizer exhibit increased diversity and enhanced folding stability.