CoLD: A Co-evolutionary Latent Diffusion Model for MSA Generation

ICLR 2026 Conference Submission13959 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multiple Sequence Alignment Generation, Diffusion Models, Protein Co-evolution
TL;DR: CoLD introduces diffusion-based MSA generation in continuous protein embedding space, enabling zero-shot homolog synthesis that substantially improves structure prediction for orphan proteins through controllable evolutionary modeling.
Abstract: Protein structure prediction relies critically on Multiple Sequence Alignments (MSAs) that capture co-evolutionary information from homologous proteins. However, orphan proteins lacking sufficient homologs present a fundamental challenge, as sparse or absent MSAs severely limit folding accuracy. Current MSA generation methods operate through discrete token-based autoregressive generation, failing to capture the continuous nature of evolutionary relationships and global co-evolutionary constraints inherent in natural protein families. We introduce CoLD (Co-evolutionary Latent Diffusion), which reformulates MSA generation as conditional diffusion in the continuous embedding space of pretrained protein language models. By modeling evolution as smooth manifold trajectories and co-evolution through joint probability distributions over entire alignment embeddings, CoLD enables controllable homolog generation with biologically interpretable evolutionary distance control. Our two-stage training paradigm first establishes reliable embedding-to-sequence mappings, then optimizes diffusion with progressive biological constraints including profile consistency, sequence diversity, and amino acid distribution alignment. Extensive evaluation on CASP14/15 benchmarks and challenging zero-shot scenarios demonstrates that CoLD substantially outperforms existing methods, achieving 11+ point improvements in confidence metrics for orphan proteins while maintaining superior conservation pattern preservation (up to 0.994 correlation). These results validate the effectiveness of continuous diffusion modeling for capturing evolutionary relationships in protein sequence generation.
Supplementary Material: zip
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 13959
Loading