ProteinRAP: Constructing Retrieval Augmented Prompts to Assist Large Language Models in Protein Understanding
Abstract: Large language models (LLMs) have demonstrated remarkable success in Natural Language Processing (NLP), primarily due to their emergent abilities derived from extensive pre-training. These pre-trained LLMs can handle numerous tasks without additional supervised fine-tuning, facilitating their transfer to various problems. However, when applied to the "language of life"—proteins, LLMs often fall short in capturing the complex relationships between amino acid sequences and their functions, resulting in suboptimal performance in related tasks. To address this issue, this study introduces **ProteinRAP**, a novel method leveraging Retrieval-Augmented Prompts (RAPs) to enhance LLM performance on protein tasks without extensive retraining. ProteinRAP comprises Protein-Text CLIP, which utilizes contrastive learning for cross-modal retrieval, and an optimized prompt learning strategy. Through RAP construction, LLMs exhibit significant improvements in protein understanding. Evaluations on both general and protein-specific LLMs in protein understanding tasks highlight existing methods' limitations. ProteinRAP markedly boosts performance, achieving up to 87.7% improvement over general LLMs and matching state-of-the-art results without additional training.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Protein language model; Protein understanding; AI for science
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 3607
Loading