Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding

Jiasheng Zhang; Delvin Ce Zhang; Shuang Liang; Zhengpin Li; Rex Ying; Jie Shao

Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding

Jiasheng Zhang, Delvin Ce Zhang, Shuang Liang, Zhengpin Li, Rex Ying, Jie Shao

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a knowledge-aware retrieval-augmented protein language model, achieving the first unified and direct integration of protein knowledge graphs and protein language models. Performance on 6 downstream tasks verify its superiority.

Abstract: Protein language models often struggle to capture biological functions due to their lack of factual knowledge (e.g., gene descriptions). Existing solutions leverage protein knowledge graphs (PKGs) as auxiliary pre-training objectives, but lack explicit integration of task-oriented knowledge, making them suffer from limited knowledge exploitation and catastrophic forgetting. The root cause is that they fail to align PKGs with task-specific data, forcing their knowledge modeling to adapt to the knowledge-isolated nature of downstream tasks. In this paper, we propose Knowledge-aware retrieval augmented protein language model (Kara), achieving the first task-oriented and explicit integration of PKGs and protein language models. With a knowledge retriever learning to predict linkages between PKG and task proteins, Kara unifies the knowledge integration of the pre-training and fine-tuning stages with a structure-based regularization, mitigating catastrophic forgetting. To ensure task-oriented integration, Kara uses contextualized virtual tokens to extract graph context as task-specific knowledge for new proteins. Experiments show that Kara outperforms existing knowledge-enhanced models in 6 representative tasks, achieving on average 5.1% improvements.

Lay Summary: This paper introduces Kara, a new method for understanding proteins using artificial intelligence. Kara works by combining two things: language models and knowledge graphs. The problem is that existing methods do not do a good job of connecting these two things. Kara fixes this by using a special "knowledge retriever" that can find the right information from the knowledge graph and add it to the language model. This helps the model understand proteins better, especially their functions and how they interact with other proteins.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Applications->Chemistry, Physics, and Earth Sciences

Keywords: Knowledge Graphs, Protein Science, Representation Learning

Submission Number: 8244

Loading