PAIR: Boosting the Predictive Power of Protein Representations with a Corpus of Text Annotations

Haonan Duan; Marta Skreta; Leonardo Cotta; Ella Miray Rajaonson; Nikita Dhawan; Alan Aspuru-Guzik; Chris J. Maddison

PAIR: Boosting the Predictive Power of Protein Representations with a Corpus of Text Annotations

Haonan Duan, Marta Skreta, Leonardo Cotta, Ella Miray Rajaonson, Nikita Dhawan, Alan Aspuru-Guzik, Chris J. Maddison

Published: 17 Jun 2024, Last Modified: 17 Jul 2024ICML2024-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: protein function predictions, protein language models, multimodal learning

TL;DR: We present a framework that leverages a diverse range of text annotations to improve the function prediction abilities of protein language models.

Abstract: Protein language models trained on raw amino acid sequences have demonstrated impressive success in various protein function prediction tasks. One explanation for this success is that language modeling for amino acid sequences captures the local evolutionary fitness landscape and, therefore, encourages the models to extract rich information about the structure and function of a protein. Yet, detecting distant evolutionary relationships from sequences alone is a challenge. In this work, we conduct a comprehensive study examining the effects of training protein models on nineteen types of expertly-curated function annotations in Swiss-Prot. We find that different annotation types had varying effects on the quality of the learned representations, with some even degrading the model's performance. However, by incorporating a carefully-selected subset of annotation types, we are able to improve the model's function prediction performance. Notably, unlike existing protein models, our approach either matches or outperforms the widely-used bioinformatics tool BLAST in annotating previously uncharacterized proteins.

Submission Number: 186

Loading