OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks

Published: 11 Oct 2024, Last Modified: 12 Nov 2024Neurips 2024 Workshop FM4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Instruction Dataset, Large Language Model, Protein modeling, Computational Biology, AI for Life Science
Abstract: Large language models (LLMs) pretrained on extensive general corpora, such as GPT-4 and Llama series, have shown exceptional performance across a wide range of natural language processing (NLP) tasks. These models provide a user-friendly and efficient interface that aligns well with user preferences through natural language instructions. Despite these advances, the application of LLMs in biomolecular sciences, particularly in protein-related research, remains constrained, with the boundaries of their capabilities yet to be fully explored. To bridge this gap, we present Open Protein Instructions (OPI), a comprehensive dataset containing over 1.64M instruction-tuning samples (98.38\% training, 1.62\% testing) dedicated to protein research. OPI enables LLMs to perform a broad array of protein-related tasks efficiently and cost-effectively. Experimental evaluations across three task categories—sequence understanding (SU), annotation prediction (AP), and knowledge mining (KM)—demonstrate OPI’s effectiveness in adapting LLMs to protein-specific applications. Our findings support the feasibility of leveraging LLMs for biomolecular research through instruction tuning. Data, codes, and instruction-tuned models are publicly available at https://github.com/baaihealth/opi to advance research in this field.
Submission Number: 9
Loading