Structure-based synthetic data augmentation for protein language models

Published: 06 Mar 2025, Last Modified: 26 Apr 2025GEMEveryoneRevisionsBibTeXCC BY 4.0
Track: Machine learning: computational method and/or computational results
Nature Biotechnology: Yes
Keywords: protein design, generative model, de novo design, protein language model, diffusion model, synthetic data, distribution shift
TL;DR: We present an experimentally-validated pipeline for using structure-based synthetic data augmentation to shift protein language model generations towards desirable traits, improving novelty and expression rate.
Abstract: The goal of $\textit{de novo}$ protein design is to leverage natural proteins to design new ones. Deep generative models of protein structure and sequence are the two dominant $\textit{de novo}$ design paradigms. Structure-based models can produce highly novel proteins, but are constrained by data to produce proteins with a narrow range of topologies. Sequence-based design models produce more natural samples over a wider range of topologies, but with reduced novelty. Here, we propose a structure-based synthetic data augmentation approach to combine the benefits of structure and sequence in generative models of proteins. We generated and characterized 240,830 $\textit{de novo}$ backbone structures and used these backbones to generate 45 million sequences for data augmentation. Models trained with structure-based synthetic data augmentation generate a shifted distribution of proteins that are more likely to express successfully in $\textit{E. coli}$ and are more thermostable. We release the trained models as well as our complete synthetic dataset, BackboneRef.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Ava_P_Amini1
Format: No, the presenting author is unable to, or unlikely to be able to, attend in person.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 62
Loading