FLOP: Tasks for Fitness Landscapes Of Protein wildtypes

01 Jun 2023 (modified: 12 Dec 2023)Submitted to NeurIPS 2023 Datasets and BenchmarksEveryoneRevisionsBibTeX
Keywords: protein engineering, enzyme engineering, representation learning, benchmark, optimization, wildtype
TL;DR: Novel tasks to benchmark protein representations on downstream supervised learning for single-family wildtype proteins
Abstract: Protein engineering has the potential to create optimized protein variants with improved properties and function. An initial step in the protein optimization process typically consists of a search among natural (wildtype) sequences to find the naturally occurring proteins with the most desirable properties. Promising candidates from this initial discovery phase then form the basis of the second step: a more local optimization procedure, exploring the space of variants separated from this candidate by a number of mutations. While considerable progress has been made on evaluating machine learning methods on single protein datasets, benchmarks of data-driven approaches for global fitness landscape exploration are still lacking. In this paper, we have carefully curated a representative benchmark dataset, which reflects industrially relevant scenarios for the initial wildtype discovery phase of protein engineering. We focus on exploration within a protein family, and investigate the downstream predictive power of various protein representation paradigms, i.e., protein language model-based representations, structure-based representations, and evolution-based representations. Our benchmark highlights the importance of coherent split strategies, and how we can be misled into overly optimistic estimates of the state of the field. The codebase and data can be accessed via https://github.com/petergroth/FLOP.
Supplementary Material: pdf
Submission Number: 847
Loading