Diffusion on language model encodings for protein sequence generation

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: diffusion, protein language models, protein generation
TL;DR: Diffusion on language model encodings for protein sequence generation
Abstract: Protein design necessitates a profound understanding of the intricate nature of the protein universe. While many efforts focus on conditional generation or specific protein families, the foundational task of unconditional generation remains underexplored and underappreciated. Existing models still struggle to achieve both high quality and diversity in generated protein sequences. To address this gap, this research introduces DiMA, a novel model that leverages latent diffusion on representations derived from the protein language model, ESM-2, to generate amino acid sequences. We quantitatively investigate the impact of components of the latent diffusion model, revealing their contributions to superior protein generation performance. Extensive evaluations using multiple metrics across two protein modalities showcase DiMA's superior quality, diversity, and distribution matching capabilities compared to leading autoregressive transformer-based and discrete diffusion models, while utilizing ten times fewer parameters. Our approach consistently produces novel, diverse protein sequences that accurately reflect the inherent structural and functional diversity of the protein space. Furthermore, we demonstrate the conditional generation capabilities of our method. Our work advances the field of protein design by providing a robust framework for scalable and high-quality protein sequence generation.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13944
Loading