A New Ultra-High-Throughput Assay for Measuring Protein Fitness

Published: 04 Mar 2024, Last Modified: 29 Apr 2024GEM OralEveryoneRevisionsBibTeXCC BY 4.0
Track: Biology: datasets and/or experimental results
Cell: I do not want my work to be considered for Cell Systems
Keywords: protein design, high-throughput experiments, Bayesian inference, variational inference, protein language model
TL;DR: We develop a new ultra-high-throughput assay for measuring protein fitness, a Bayesian method of denoising results from this assay, and benchmark models on a novel dataset.
Abstract: Machine learning (ML) for protein design frequently requires large datasets of protein fitness measurements generated by high-throughput experiments; however, publicly available protein fitness datasets generated by deep mutational scanning are noisy and only include $10^3$ to $10^5$ data points. In this work, we present DHARMA, a new protein fitness assay using molecular recording via base editors and high-throughput sequencing to measure the fitness of up to $10^6$ variants. To mitigate noise in DHARMA experiments, we design a Bayesian inference method FLIGHTED that denoises the output of a DHARMA experiment for downstream ML applications. Our results show that DHARMA and FLIGHTED can accurately measure protein fitness with calibrated errors. Using this technology, we generate a new fitness dataset of $160000$ TEV protease variants and benchmark a number of standard ML models, including protein language model embeddings, on this dataset. We find that data size is the single most important factor in determining ML model performance and that scaling up protein language models does not currently improve performance. DHARMA and FLIGHTED can help generate more large protein fitness datasets for the ML community.
Submission Number: 63
Loading