Keywords: multiple sequence alignments, discrete flow matching, latent diffusion, protein design
Abstract: Multiple Sequence Alignments (MSAs) provide fundamental information about protein evolution and play crucial roles in downstream applications, such as structure prediction and family-based design. However, constructing high-quality MSAs requires significant computational resources to query natural protein databases, and traditional techniques fail to retrieve sufficient data for proteins with limited homology. While recent generative models have been proposed for MSA augmentation, they often struggle to capture complex, high-order dependencies in sequence distributions while maintaining permutation invariance. To address these challenges, we introduce MSAFlow, a framework built on two key innovations. First, its core is a novel generative autoencoder that pairs a compressed AlphaFold3 (AF3) MSA representation with a conditional Statistical Flow Matching (SFM) decoder to faithfully model a family's sequence distribution that preserves permutation invariance. Second, we introduce a latent flow-matching model that performs zero-shot generation of MSA embeddings from a single sequence, enabling powerful augmentation for orphan proteins. By integrating these components, MSAFlow operates as a unified framework for MSA representation, augmentation, and family-based design. Our experiments demonstrate that MSAFlow significantly outperforms existing models on family-based protein design and MSA augmentation tasks, especially for low-homology proteins. MSAFlow is lightweight, fast, and memory-efficient, offering a single, versatile solution for diverse protein engineering tasks.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 20688
Loading