ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding

ICML 2025 Workshop FM4LS Submission19 Authors

Published: 12 Jul 2025, Last Modified: 12 Jul 2025FM4LS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-modal LLM, Bioinformatics, Modality Alignment, AI for Science, Protein, Large Language Models, LLM
TL;DR: We propose ProteinGPT, a novel multimodal large language model designed specifically for protein analysis, which enables users to upload protein sequences and/or structures for in-depth analysis and dynamic responses.
Abstract: Understanding biological processes, drug development, and biotechnological advancements requires a thorough analysis of protein structures and functions, which is often complex and time-consuming in traditional research methods. To address this challenge, we present ProteinGPT, a cutting-edge multimodal large language model designed specifically for proteins. ProteinGPT enables users to upload protein sequences and/or structures for in-depth analysis and dynamic responses to inquiries. The model integrates protein sequence and structure encoders with linear projection layers to ensure accurate and adaptive representations. Using a large language model (LLM), it generates precise and contextually relevant answers. For training, we curated a large-scale dataset consisting of 132,092 proteins, each annotated with 20-30 property tags and 5-10 QA pairs, and optimized the instruction-tuning process using GPT-4o. Experiments show that ProteinGPT excels in generating informative and relevant responses to protein-related questions, outperforming baseline models and general-purpose LLMs in both semantic and lexical metrics. Our code and data are anonymously available at https://github.com/ProteinGPT/ProteinGPT.
Submission Number: 19
Loading