Pairing interacting protein sequences using masked language modeling

Published: 04 Mar 2024, Last Modified: 29 Apr 2024GEM PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Machine learning: computational method and/or computational results
Cell: I do not want my work to be considered for Cell Systems
Keywords: Protein language models, AlphaFold Multimer, Multiple sequence alignments, Protein-protein interactions, Protein complexes, Pairing paralogs
TL;DR: We introduce DiffPALM, a method leveraging Masked Language Modeling to predict interacting protein pairs, achieving superior performance than coevolution-based pairing and enhancing three-dimensional structure prediction for protein complexes.
Abstract: Predicting which proteins interact together from amino-acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments, such as MSA Transformer and the EvoFormer module of AlphaFold. We introduce a method called DiffPALM that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer.
Submission Number: 38
Loading