everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Protein design necessitates a profound understanding of the intricate nature of the protein universe. While many efforts focus on conditional generation or specific protein families, the foundational task of unconditional generation remains underexplored and underappreciated. Existing models still struggle to achieve both high quality and diversity in generated protein sequences. To address this gap, this research introduces DiMA, a novel model that leverages latent diffusion on representations derived from the protein language model, ESM-2, to generate amino acid sequences. We quantitatively investigate the impact of components of the latent diffusion model, revealing their contributions to superior protein generation performance. Extensive evaluations using multiple metrics across two protein modalities showcase DiMA's superior quality, diversity, and distribution matching capabilities compared to leading autoregressive transformer-based and discrete diffusion models, while utilizing ten times fewer parameters. Our approach consistently produces novel, diverse protein sequences that accurately reflect the inherent structural and functional diversity of the protein space. Furthermore, we demonstrate the conditional generation capabilities of our method. Our work advances the field of protein design by providing a robust framework for scalable and high-quality protein sequence generation.