Pairing interacting protein sequences using masked language modeling

Damiano Sgarbossa; Umberto Lupo; Anne-Florence Bitbol

Pairing interacting protein sequences using masked language modeling

Damiano Sgarbossa, Umberto Lupo, Anne-Florence Bitbol

Published: 04 Mar 2024, Last Modified: 07 May 2024MLGenX 2024 SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Protein language models, AlphaFold Multimer, Multiple sequence alignments, Protein-protein interactions, Protein complexes, Pairing paralogs

TL;DR: We introduce DiffPALM, a method leveraging Masked Language Modeling to predict interacting protein pairs, achieving superior performance than coevolution-based pairing and enhancing three-dimensional structure prediction for protein complexes.

Abstract:

Predicting which proteins interact together from amino-acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments, such as MSA Transformer and the EvoFormer module of AlphaFold. We introduce a method called DiffPALM that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer.

Submission Number: 23

Loading