PROTEIN FITNESS LANDSCAPE NAVIGATION IS BOOSTED VIA INCORPORATING EVOLUTIONARY INFORMATION INTO MACHINE LEARNING MODELS
Track: Machine learning: computational method and/or computational results
Keywords: Protein Fitness Landscape, Machine Learning, Natural Language Processing, Ancestral Sequence Reconstruction, Uncertainty Robustness
Abstract: In the rapidly evolving field of protein engineering, the integration of evolutionary insights into machine learning (ML) models emerges as a pivotal strategy to transcend current limitations in data quality and diversity. This paper presents a novel methodology that integrates evolutionary insights from ancestral sequence reconstruction (ASR) into ML models, setting a new standard for data-centric strategies in the field. Our innovation lies in harnessing ASR to produce datasets that not only embody the requisite complexity for protein engineering—focusing on the ability of proteins to fold correctly and maintain thermal stability—but also capitalize on the inherent robustness of ASR against uncertainty in maximum likelihood estimations to vastly expand the available data pool. These methods, in turn, are expected to markedly improve outcomes in protein engineering tasks. In this project, we showcase the value of evolutionary datasets, meticulously curated through ASR as a (i) rich source of training data for generative models. Moreover, we highlight their utility in deriving (ii) family-specific protein representations, which resulted in an enhanced ML prediction task among multiple protein families. This pioneering work underscores the potential of evolutionary data to revolutionize ML model efficacy in protein engineering by providing datasets that are both extensive in quantity and unmatched in functional quality, marking a significant leap forward in our ability to engineer proteins with precision and efficiency.
Submission Number: 113
Loading