Diffusion on Language Model Encodings for Protein Sequence Generation

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0
TL;DR: A continuous latent diffusion framework for protein sequence generation with strong performance and versatile conditional generation capabilities.
Abstract: Protein *sequence* design has seen significant advances through discrete diffusion and autoregressive approaches, yet the potential of continuous diffusion remains underexplored. Here, we present *DiMA*, a latent diffusion framework that operates on protein language model representations. Through systematic exploration of architectural choices and diffusion components, we develop a robust methodology that generalizes across multiple protein encoders ranging from 8M to 3B parameters. We demonstrate that our framework achieves consistently high performance across sequence-only (ESM-2, ESMc), dual-decodable (CHEAP), and multimodal (SaProt) representations using the same architecture and training approach. We conduct extensive evaluation of existing methods alongside *DiMA* using multiple metrics across two protein modalities, covering quality, diversity, novelty, and distribution matching of generated proteins. *DiMA* consistently produces novel, high-quality and diverse protein sequences and achieves strong results compared to baselines such as autoregressive, discrete diffusion and flow matching language models. The model demonstrates versatile functionality, supporting conditional generation tasks including protein family-generation, motif scaffolding and infilling, and fold-specific sequence design, despite being trained solely on sequence data. This work provides a universal continuous diffusion framework for protein sequence generation, offering both architectural insights and practical applicability across various protein design scenarios. Code is released at [GitHub](https://github.com/MeshchaninovViacheslav/DiMA).
Lay Summary: Proteins are fundamental biological molecules that drive cellular processes and hold tremendous potential for advancing medicine, biotechnology, and sustainable materials. However, designing new proteins remains challenging due to the vast number of possible amino acid sequences and the complex relationship between sequence and function. We present DiMA, an artificial intelligence framework that generates novel protein sequences by learning from existing protein data. Unlike previous methods, DiMA operates efficiently across different protein representations while maintaining both quality and diversity in generated sequences. Our approach uses significantly fewer computational resources than comparable systems while achieving superior performance. Comprehensive testing demonstrates that DiMA produces structurally viable proteins that are genuinely novel rather than variations of known sequences. The system can be directed to generate proteins with specific characteristics, such as particular functional motifs or structural properties. This work addresses a key bottleneck in protein engineering by making high-quality protein design more accessible and computationally efficient. The implications span drug discovery, industrial biotechnology, and fundamental biological research, potentially accelerating the development of therapeutic proteins, sustainable enzymes, and novel biomaterials.
Link To Code: https://github.com/MeshchaninovViacheslav/DiMA
Primary Area: Applications->Chemistry, Physics, and Earth Sciences
Keywords: continuous diffusion, generative protein design, score matching, denoising models, sequence modeling
Submission Number: 12885
Loading