Keywords: proteins, ml for protein engineering, generative models, latent diffusion
TL;DR: PLAID generates all-atom protein structure by diffusing in the latent space of a sequence-to-structure model; since it only requires sequence inputs, we expand the usable data distribution by 2 to 4 orders of magnitude.
Abstract: Using generative models for protein design is gaining interest for their potential scientific impact. However, biological processes are mediated by many modalities, and simultaneous generating multiple biological modalities is a continued challenge. We propose **PLAID (Protein Latent Induced Diffusion)**, whereby multimodal biological generation is achieved by learning and sampling from the *latent space of a predictor* from a more abundant data modality (e.g. sequence) to a less abundant data modality (e.g. crystallized structure). Specifically, we examine the *all-atom* structure generation setting, which requires producing both the 3D structure and 1D sequence, to specify how to place sidechain atoms that are critcial to function. Crucially, since PLAID **only requires sequence inputs to obtain the latent representation during training**, it allows us to use sequence databases when training the generative model, thus augmenting the sampleable data distribution by $10^2×$ to $10^4×$ compared to experimental structure databases. Using sequence-only training further unlocks more annotations that can be used to conditioning model generation. As a demonstration, we use two conditioning variables: 2219 function keywords from Gene Ontology, and 3617 organisms across the tree of life. Despite not receiving structure inputs during training, model generations nonetheless exhibit strong performance on structure quality, diversity, novelty, and cross-modal consistency metrics. Analysis of function-conditioned samples show that generated structures preserve non-adjacent catalytic residues at active sites, and learn the hydrophobicity pattern of transmembrane proteins, while exhibiting overall sequence diversity. Model weights and code are publicly accessible at `[redacted]`.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13034
Loading