Abstract: Predicting protein-protein interactions (PPIs) is vital for elucidating fundamental biology, designing
peptide therapeutics, and for high-throughput protein annotation. This is particularly relevant in
the current biotechnology landscape characterized by the proliferation of protein generative models, which necessitate a high-throughput and generalized PPI predictor for proteins regardless of
conventional motifs or known biological functions. Our work addresses this need and provides
strong evidence of the utility and reliability of protein language models (pLMs) in learning the
PPI objective. We demonstrated that with the use of a sizable balanced dataset, pLMs achieve
state-of-the-art performance metrics in PPI prediction on diverse proteins. To generate a dataset that
allows for the approximation of these conditions, we implemented a novel synthetic data generation
scheme to augment BIOGRID and Negatome datasets. The enhancement of these datasets was
then used to fine-tune ProtBERT for PPI prediction to develop a model that we call SYNTERACT
(SYNThetic data-driven protein-protein intERACtion Transformer). Our results are compelling,
demonstrating 92% accuracy on validated positive and negative interacting pairs derived from 50
different organisms, all of which were excluded from the training phase. In addition to the high
metrics, secondary analysis revealed that our synthetic negative data was able to successfully mimic
actual negative samples, further reinforcing the integrity of synthetic data additions to PPI datasets.
Another notable discovery was the ease in which previously existing PPI datasets could be predicted
with simplistic features, calling into question if they can actually inform PPI prediction. We find that
the subcellular compartment bias inherent to the compilation of these datasets is learnable with deep
learning methods and demonstrate that our approach is not burdened by this disadvantage.
Loading