A New Ultra-High-Throughput Assay for Measuring Protein Fitness

Vikram Sundar; Boqiang Tu; Lindsey Guan; Kevin M. Esvelt

A New Ultra-High-Throughput Assay for Measuring Protein Fitness

Vikram Sundar, Boqiang Tu, Lindsey Guan, Kevin M. Esvelt

Published: 04 Mar 2024, Last Modified: 29 Apr 2024GEM OralEveryoneRevisionsBibTeXCC BY 4.0

Track: Biology: datasets and/or experimental results

Cell: I do not want my work to be considered for Cell Systems

Keywords: protein design, high-throughput experiments, Bayesian inference, variational inference, protein language model

TL;DR: We develop a new ultra-high-throughput assay for measuring protein fitness, a Bayesian method of denoising results from this assay, and benchmark models on a novel dataset.

Abstract: Machine learning (ML) for protein design frequently requires large datasets of protein fitness measurements generated by high-throughput experiments; however, publicly available protein fitness datasets generated by deep mutational scanning are noisy and only include $10^3$ to $10^5$ data points. In this work, we present DHARMA, a new protein fitness assay using molecular recording via base editors and high-throughput sequencing to measure the fitness of up to $10^6$ variants. To mitigate noise in DHARMA experiments, we design a Bayesian inference method FLIGHTED that denoises the output of a DHARMA experiment for downstream ML applications. Our results show that DHARMA and FLIGHTED can accurately measure protein fitness with calibrated errors. Using this technology, we generate a new fitness dataset of $160000$ TEV protease variants and benchmark a number of standard ML models, including protein language model embeddings, on this dataset. We find that data size is the single most important factor in determining ML model performance and that scaling up protein language models does not currently improve performance. DHARMA and FLIGHTED can help generate more large protein fitness datasets for the ML community.

Submission Number: 63

Loading