MSA Generation with Seqs2Seqs Pretraining: Advancing Protein Structure Predictions

Le Zhang; Jiayang Chen; Tao Shen; Yu Li; Siqi Sun

MSA Generation with Seqs2Seqs Pretraining: Advancing Protein Structure Predictions

Le Zhang, Jiayang Chen, Tao Shen, Yu Li, Siqi Sun

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Protein Language Model

TL;DR: We introduce MSA-Generator, a self-supervised model that generates virtual MSAs, enhancing protein structure predictions in key benchmarks.

Abstract: Deep learning models like AlphaFold2 have revolutionized protein structure prediction, achieving unprecedented accuracy. However, the dependence on robust multiple sequence alignments (MSAs) continues to pose a challenge, especially for proteins that lack a wealth of homologous sequences. To overcome this limitation, we introduce MSA-Generator, a self-supervised generative protein language model. Trained on a sequence-to-sequence task using an automatically constructed dataset, MSA-Generator employs protein-specific attention mechanisms to harness large-scale protein databases, generating virtual MSAs that enrich existing ones and boost prediction accuracy. Our experiments on CASP14 and CASP15 benchmarks reveal significant improvements in LDDT scores, particularly for complex and challenging sequences, enhancing the performance of both AlphaFold2 and RoseTTAFold. The code is released at \url{https://github.com/lezhang7/MSAGen}.

Primary Area: Machine learning for other sciences and fields

Submission Number: 12787

Loading