Benchmarking probabilistic machine learning in protein fitness landscape predictions

Published: 17 Jun 2024, Last Modified: 16 Jul 2024ML4LMS PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: protein fitness prediction, probabilistic modeling, uncertainty quantification
Abstract: Machine learning guided protein engineering, which consists of high-throughput screening and deep sequencing of protein mutagenesis libraries combined with machine learning is a powerful approach for engineering proteins and interrogating their fitness landscapes. Uncertainty quantification enhances the trustworthiness of model predictions by indicating reliability and thus can be used to guide downstream experimental work. Aleatoric uncertainty identifying inherent observational noise in protein properties and epistemic uncertainty revealing gaps in the model’s knowledge based on the amount of training data. Although uncertainty quantification has been investigated in the application of protein engineering, systematic benchmarks for probabilistic machine learning model selection and the benefits of different types of uncertainty in protein fitness predictions are lacking. Addressing this gap, our study benchmarks six advanced probabilistic modeling techniques across eleven diverse protein-fitness datasets, employing evaluation metrics on prediction accuracy and uncertainty quality to assess performance for both in-distribution and out-ofdistribution scenarios. Our findings offer valuable insights into the application of uncertaintyaware machine learning in high-throughput protein screening experiments. Our study supports more robust, efficient experimental processes and enhances the practical usability of machine learning models in real-word protein fitness related tasks such as therapeutic antibody optimization and viral evolution.
Poster: pdf
Submission Number: 75
Loading