Accurate predictions of enzymatic biochemistry as an enabler for generation of de-novo sequences

Published: 04 Mar 2024, Last Modified: 29 Apr 2024GEM PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Biology: datasets and/or experimental results
Keywords: biological dataset, enzyme, biochemistry, machine learning, oracle, functional annotation
TL;DR: We publish the most comprehensive dataset of a practically crucial enzyme class. Also, we present ML oracles for reliable in-silico annotation for the corresponding enzymatic activity.
Abstract: Terpene synthases (TPSs) generate the scaffolds of the largest class of natural products, including several first-line medicines. The amount of available TPS protein sequences is increasing exponentially, but computational characterization of their function remains an unsolved challenge. We assembled a curated dataset of one thousand characterized TPS reactions and developed a method to devise highly accurate machine-learning models for functional annotation in a low-data regime. Our models significantly outperform existing methods for TPS detection and substrate prediction. By applying the models to large protein sequence databases, we discovered seven TPS enzymes previously undetected by state-of-the-art computational tools and experimentally confirmed their activity. Furthermore, we discovered a new TPS structural domain and distinct subtypes of previously known domains. Our work demonstrates the potential of machine learning to speed up the discovery and characterization of novel TPSs. Furthermore, in-silico functional annotations provide the ML community with a large dataset of pseudo-labeled exemplary TPS sequences. The accurate models for TPS detection and substrate prediction can serve as oracles to check the presence of desired biochemical activity in the generated sequences. We envision the published dataset of exemplary TPS sequences and the accurate TPS-annotation models to boost the generation of de-novo enzymatic TPS sequences.
Submission Number: 80
Loading