Keywords: Protein Foundation Model, Sparse Experts Model, Protein Property Prediction, Protein Generation
Abstract: Proteins play a fundamental role in life. Understanding the language of proteins
offers significant potential for gaining mechanistic insights into biological sys-
tems and introduces new avenues for treating diseases, enhancing agriculture, and
safeguarding the environment. While large protein language models (PLMs) like
ESM2-15B and xTrimoPGLM-100B have achieved remarkable performance in di-
verse protein understanding and design tasks, these models, being dense transformer
models, pose challenges due to their computational inefficiency during training
and deployment. In this work, we introduce AIDO.Protein, a pretrained module
for protein representation in an AI-driven Digital Organism [1 ]. AIDO.Protein is
also the first mixture-of-experts (MoE) model in the protein domain, with model
size scales to 16 billion parameters. Leveraging a sparse MoE architecture with
8 experts within each transformer block and selectively activating 2 experts for
each input token, our model is significantly more efficient in training and inference.
Through pre-training on 1.2 trillion amino acids collected from UniRef90 and
ColabfoldDB, our model achieves state-of-the-art results across most tasks in the
xTrimoPGLM benchmark. Furthermore, on over 280 ProteinGym Deep Mutational
Scanning (DMS) assays, our model achieves nearly 99% of the overall performance
of the best MSA-based model and significantly outperforms the previously reported
state-of-the-art models that do not utilize MSA. We also adapted this model for
structure-conditioned protein sequence generation tasks and achieved new SOTA
in this domain. These results indicate that AIDO.Protein can serve as a strong
foundation model for protein understanding and design. Models and codes are
available through ModelGenerator in https://github.com/genbio-ai/AIDO
and on Hugging Face.
Submission Number: 124
Loading