MSAFlow: a Unified Approach for MSA Representation, Augmentation, and Family-based Protein Design

Published: 24 Sept 2025, Last Modified: 26 Dec 2025NeurIPS2025-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: multiple sequence alignments, discrete flow matching, latent diffusion, protein design
Abstract: Multiple Sequence Alignments (MSAs) provide fundamental information about protein evolutionary trajectories and play crucial roles in downstream tasks such as augmentation and family-based design . However, constructing high-quality MSAs requires significant computational resources to query natural protein databases, and traditional techniques fail to provide relevant data for proteins with limited evolutionary information. While deep learning approaches have shown promise in MSA construction and augmentation, they fail to capture rich distributional information while preserving permutation invariance. MSAFlow addresses these limitations using a Statistical Flow Matching model conditioned on compressed latent MSA representations to generate sequences that would likely belong to the target MSA. This approach captures distributional information while augmenting shallow MSAs and maintaining permutation invariance. Experiments confirm that MSAFlow generates MSAs with performance comparable to traditional methods on family-based design tasks. The model outperforms existing machine learning augmentation tools while achieving very low inference time and memory efficiency despite being lightweight and trained on smaller datasets. MSAFlow enables family-based protein design for enzymes and synthetic MSA generation through latent diffusion. Extensive ablation studies validate the effectiveness of model design components. Overall, MSAFlow provides a robust and efficient framework for MSA representation and integration in downstream applications.
Submission Number: 294
Loading