FLOP: Tasks for Fitness Landscapes Of Protein families using sequence- and structure-based representationsDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: protein engineering, representation learning, generalization, benchmark, enzyme engineering, protein structure, protein language model
Abstract: Protein engineering has the potential to create optimized protein variants with improved properties and function. An initial step in the protein optimization process typically consists of a search among natural (wildtype) sequences to find the naturally occurring protein with the most desirable properties. This chosen candidate is then the basis for the second step: a more local optimization procedure, exploring the space of variants separated from this candidate by a few mutations. While advances in protein representation learning promise to facilitate the exploration of wildtype space, results from real-life cases are often underwhelming, and progress in the area difficult to track. In this paper, we have carefully curated a representative benchmark dataset, which reflects industrially-relevant scenarios for the initial wildtype exploration of protein engineering. We focus on the exploration within a protein family or superfamily, and investigate the downstream predictive power of various dominating protein representation paradigms, i.e., transformer-based language representations, structure-based representations, and evolution-based representations. Our benchmark highlights the importance of coherent split strategies, and how we can be misled into overly optimistic estimates of the state of the field. We hope our benchmark can drive further methodological developments in this important field.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Infrastructure (eg, datasets, competitions, implementations, libraries)
TL;DR: Novel benchmark dataset for exploration of single family protein fitness landscapes for protein engineering
Supplementary Material: zip
6 Replies

Loading