How well do generative protein models generate?

Published: 04 Mar 2024, Last Modified: 29 Apr 2024GEM PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Machine learning: computational method and/or computational results
Keywords: Protein Design, Generative Models, Evaluation, Benchmarks
TL;DR: How do you know if your sequences are any good!? We propose a comprehensive set of evaluation metrics for generated protein sequences.
Abstract: Protein design relies critically on the generation of plausible sequences. Yet, the efficacy of many common model architectures from simple interpretable models, like position-specific scoring matrix (PSSM) and direct couplings analysis (DCA), to newer and less interpretable models, like variational autoencoders (VAEs), autoregressive large language models (AR-LLMs) and flow matching (FM), for sequence sampling remains uncertain. While some models offer unique sequence generation methods, issues such as mode collapse, generation of nonsensical repeats, and protein truncations persist. Trusted methods like Gibbs sampling are often preferred for their reliability, but can be computationally expensive. This paper addresses the need to evaluate the performance and limitations of different generation methods from protein models, considering dependencies on multiple sequence alignment (MSA) depth and available sequence diversity. We propose rigorous evaluation methods and metrics to assess sequence generation, aiming to guide design decisions and inform the development of future model and sampling techniques for protein design applications.
Submission Number: 109
Loading