Keywords: Positive-Unlabeled Learning, Protein Design, One-class classification
Abstract: We consider prediction of protein function, focusing on protein functionalities that enhance survival for one or more organisms. Sequencing these organisms provides plentiful positive training examples due to survivorship bias. In contrast, synthesizing and characterizing a protein with a mutation unseen in nature requires time-consuming wet lab experiments, making negative training examples scarce. Thus, datasets are often imbalanced, hindering classifier accuracy outside the training data. Positive-unlabeled (PU) learning attempts to address this issue by considering unlabeled protein sequences to be part of the data and modeling them as positive with a probability called the class prior. This class prior is often constant. Our insight is that an understanding of evolution suggests a novel sequence-dependent class prior when learning from sequencing data. We propose Evo-PU, a PU learning framework that integrates our novel class prior to create a likelihood for training classifiers. We evaluate Evo-PU on multiple real-world tasks on influenza hemagglutinin protein. Using influenza genomic surveillance data and held-out laboratory assays of mutants unseen in nature, Evo-PU outperforms state-of-the-art PU learning, one-class classification (OCC), and deep generative model-based methods (DGM) on these real-world problems, demonstrating the benefit of combining evolutionary modeling with data-driven learning for protein design. We further assess Evo-PU on standard ProteinGym benchmarks, focusing on protein overall fitness prediction. Evo-PU outperforms existing PU-learning and OCC baselines, while remains competitive to DGM-based approaches.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 22765
Loading