ProteinRAP: Constructing Retrieval Augmented Prompts to Assist Large Language Models in Protein Understanding

ProteinRAP: Constructing Retrieval Augmented Prompts to Assist Large Language Models in Protein Understanding

ACL ARR 2025 February Submission3607 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have demonstrated remarkable success in Natural Language Processing (NLP), primarily due to their emergent abilities derived from extensive pre-training. These pre-trained LLMs can handle numerous tasks without additional supervised fine-tuning, facilitating their transfer to various problems. However, when applied to the "language of life"—proteins, LLMs often fall short in capturing the complex relationships between amino acid sequences and their functions, resulting in suboptimal performance in related tasks. To address this issue, this study introduces **ProteinRAP**, a novel method leveraging Retrieval-Augmented Prompts (RAPs) to enhance LLM performance on protein tasks without extensive retraining. ProteinRAP comprises Protein-Text CLIP, which utilizes contrastive learning for cross-modal retrieval, and an optimized prompt learning strategy. Through RAP construction, LLMs exhibit significant improvements in protein understanding. Evaluations on both general and protein-specific LLMs in protein understanding tasks highlight existing methods' limitations. ProteinRAP markedly boosts performance, achieving up to 87.7% improvement over general LLMs and matching state-of-the-art results without additional training.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Protein language model; Protein understanding; AI for science

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 3607

Loading