Green fluorescent protein engineering with a biophysics-based protein language model

Published: 04 Mar 2024, Last Modified: 29 Apr 2024GEM PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Biology: datasets and/or experimental results
Cell: I do not want my work to be considered for Cell Systems
Keywords: Protein engineering, biophysics, protein language model, molecular simulations, finetuning, simulated annealing
Abstract: Deep neural networks and language models are revolutionizing protein modeling and design, but these models struggle in low data settings and when generalizing beyond their training data. Although prior neural networks have proven capable in learning complex evolutionary or sequence-structure-function relationships from large datasets, they largely ignore the vast accumulated knowledge of protein biophysics, limiting their ability to perform the strong generalization required for protein engineering. We introduce Mutational Effect Transfer Learning (METL), a specialized protein language model for predicting quantitative protein function that bridges the gap between traditional biophysics-based and machine learning approaches. METL incorporates synthetic data from molecular simulations as a means to augment experimental data with biophysical information. Molecular modeling can generate large datasets revealing mappings from amino acid sequences to protein structure and properties. Pretraining protein language models on this data can impart fundamental biophysical knowledge that can be connected with experimental observations. To demonstrate METL's ability to guide protein engineering with limited training data, we applied it to design green fluorescent protein sequence variants in complex scenarios. Of the 20 designed sequences, 16 exhibited fluorescence, and 6 exhibited greater fluorescence than the wild type.
Submission Number: 39