Keywords: Multi-modal LLM, Bioinformatics, Modality Alignment, AI for Science, Protein, Large Language Models, LLM
TL;DR: We propose ProteinGPT, a novel multimodal large language model designed specifically for protein analysis, which enables users to upload protein sequences and/or structures for in-depth analysis and dynamic responses.
Abstract: Understanding biological processes, drug development, and biotechnological advancements requires a thorough analysis of protein structures and functions, which is often complex and time-consuming in traditional research methods. To address this challenge, we present ProteinGPT, a cutting-edge multimodal large language model designed specifically for proteins. ProteinGPT enables users to upload protein sequences and/or structures for in-depth analysis and dynamic responses to inquiries. The model integrates protein sequence and structure encoders with linear projection layers to ensure accurate and adaptive representations. Using a large language model (LLM), it generates precise and contextually relevant answers. For training, we curated a large-scale dataset consisting of 132,092 proteins, each annotated with 20-30 property tags and 5-10 QA pairs, and optimized the instruction-tuning process using GPT-4o. Experiments show that ProteinGPT excels in generating informative and relevant responses to protein-related questions, outperforming baseline models and general-purpose LLMs in both semantic and lexical metrics. Our code and data are anonymously available at https://github.com/ProteinGPT/ProteinGPT.
Submission Number: 19
Loading