Keywords: state space model, foundation model, protein design, protein fitness prediction
Abstract: Protein design has important implications for drug discovery, personalized medicine, and biotechnology. Models based on multiple sequence alignments efficiently capture the evolutionary information in homologous protein sequences, but multiple sequence alignment construction is imperfect. We present ProtMamba, a homology-aware but alignment-free protein language model based on the Mamba architecture. In contrast with attention-based models, ProtMamba efficiently handles very long context, comprising hundreds of protein sequences. Trained on a large dataset of concatenated homologous sequences, ProtMamba combines autoregressive and masked language modeling through a fill-in-the-middle objective. We demonstrate ProtMamba’s usefulness for the generation of novel sequences and for fitness prediction. ProtMamba reaches competitive performance with other protein language models despite its smaller size, which sheds light on the importance of long-context conditioning.
Submission Number: 46
Loading