Keywords: Protein Language Model, Retrieval-augmentation
TL;DR: AIDO.RAGPLM and AIDO.RAGFold enable fast and accurate protein structure prediction, outperforming AlphaFold2 in low-MSA settings and reducing reliance on deep alignments.
Abstract: The advent of advanced artificial intelligence technology has significantly accelerated progress in protein structure prediction, with AlphaFold2 setting a new benchmark for prediction accuracy by leveraging the Evoformer module to automatically extract co-evolutionary information from multiple sequence alignments (MSA). To address AlphaFold2’s dependence on MSA depth and quality, we propose two novel models: AIDO.RAGPLM and AIDO.RAGFold, pre-trained modules for Retrieval-AuGmented protein language model and structure prediction in an AI-driven Digital Organism. AIDO.RAGPLM integrates pre-trained protein language models with retrieved MSA, surpassing single-sequence protein language models in perplexity, contact prediction, and fitness prediction. When sufficient MSA is available, AIDO.RAGFold achieves TM-scores comparable to AlphaFold2 while operating up to eight times faster, and significantly outperforms AlphaFold2 when MSA is insufficient (∆TM-score=0.379, 0.116 and 0.059 for 0, 5 and 10 MSA sequences as input). Additionally, we developed an MSA retriever using hierarchical ID generation that is 45 to 90 times faster than traditional methods, expanding the MSA training set for AIDO.RAGPLM by 32%. Our findings suggest that AIDO.RAGPLM provides an efficient and accurate solution for protein structure prediction, particularly in scenarios with limited MSA data. The AIDO.RAGPLM model has been open-sourced and is available on https://huggingface.co/genbio-ai/AIDO.Protein-RAG-3B.
Submission Number: 28
Loading