Endowing Protein Language Models with Structural Knowledge

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: protein representation learning, protein language models, self-supervised learning, graph transformers
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: A framework for endowing pretrained protein language models with structural knowledge
Abstract: Protein language models have shown strong performance in predicting function and structure across diverse tasks. These models undergo unsupervised pretraining on vast sequence databases to generate rich protein representations, followed by finetuning with labeled data on specific downstream tasks. The recent surge in computationally predicted protein structures opens new opportunities in protein representation learning. In our study, we introduce a novel framework to enhance transformer protein language models specifically on protein structures. Drawing from recent advances in graph transformers, our approach refines the self-attention mechanisms of pretrained language transformers by integrating structural information with structure extractor modules. This refined model, termed the Protein Structure Transformer (PST), is further pretrained on a protein structure database such as AlphaFoldDB, using the same masked language modeling objective as traditional protein language models. Our empirical findings show superior performance on several benchmark datasets. Notably, PST consistently outperforms the foundation model for protein sequences, ESM-2, upon which it is built. Our code and pretrained models will be released upon publication.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7966
Loading