Abstract: Generating novel enzymes for target molecules in zero-shot scenarios is a fundamental challenge in biomaterial synthesis and chemical production. Without known enzymes for a target molecule, training generative models becomes difficult due to the lack of direct supervision. To address this, we propose a retrieval-augmented generation method that uses existing enzyme-substrate data to guide enzyme design. Our method retrieves enzymes with substrates that share structural similarities with the target molecule, leveraging functional similarities in catalytic activity. Since none of the retrieved enzymes directly catalyze the target molecule, we use a conditioned discrete diffusion model to generate new enzymes based on the retrieved examples. An enzyme-substrate relationship classifier guides the generation process to ensure optimal protein sequence distributions. We evaluate our model on enzyme design tasks with diverse real-world substrates and show that it outperforms existing protein generation methods in catalytic capability, foldability, and docking accuracy. Additionally, we define the zero-shot substrate-specified enzyme generation task and introduce a dataset with evaluation benchmarks.
Lay Summary: Designing enzymes for new molecules is essential for advancing green chemistry, medicine, and sustainable materials. But creating enzymes from scratch—especially when none are known to work on a target molecule—is extremely difficult. Most AI models need examples of success to learn from, and in this case, there are none. Our approach tackles this by retrieving enzymes that work on molecules similar to the target, using them as inspiration for design. We then use a powerful AI model to generate enzymes tailored to the target, guided by a system that checks whether the enzyme is likely to work. This ensures the enzymes we create aren’t just random sequences—they’re functional and realistic. We tested our method on many real-world molecules, and it consistently outperformed existing techniques in creating useful, stable enzymes. Our work opens the door to designing enzymes for entirely new molecules—without needing prior examples—and could help speed up innovation in drug discovery and biomanufacturing. We also introduce new benchmarks to measure progress in this challenge going forward.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Chemistry, Physics, and Earth Sciences
Keywords: Enzyme design, Diffusion model
Submission Number: 5934
Loading