everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
The ability to generate novel enzymes that catalyze specific target molecules is a critical advancement in biomaterial synthesis and chemical production. However, a significant challenge arises when no recorded enzymes exist for the target molecule, making it a zero-shot generation problem. This absence of known enzymes complicates the training of generative models tailored to the target substrate. To address this, we propose a retrieval-augmented generation method that leverages existing enzyme-substrate data to overcome the lack of direct examples. Since there is no recorded catalytic performance between the enzymes and the new target molecule, the challenge shifts to identifying enzymes that helpful for generation. Our approach tackles this by retrieving enzymes whose substrates exhibit structural similarities to the target molecule, thereby exploiting functional similarities reflected in the enzymes' catalytic capability. This leads to the next challenge: how to utilize the retrieved enzymes to generate a novel enzyme capable of catalyzing the target molecule, given that none of the retrieved enzymes directly catalyze it. To solve this, we employ a conditioned discrete diffusion model that takes the aligned retrieved enzymes to generate a new enzyme. We train the generator with guidance from an enzyme-substrate relationship classifier to make it output the optimal protein sequence distribution for different target molecule. We evaluate our model on enzyme design tasks involving a diverse set of real-world substrates, and our results including catalytic rate predictions, foldability assessments, and docking position analyses, demonstrate that our model outperforms existing protein generation methods for substrate-specified enzyme generation. Additionally, we formally define the zero-shot substrate-specified enzyme generation task and contribute a comprehensive dataset with evaluation methods.