Protein2Text: Providing Rich Descriptions from Protein Sequences

Published: 31 Jul 2025, Last Modified: 16 Aug 2025LM4SciEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Protein2Text, Biological Large Models, Protein Language Models, Natural Language Processing, Protein Description Prediction, Deep Learning, Protein Annotations, Generative AI
TL;DR: We present BetaDescribe, a LLAMA2-based model fine-tuned on biological and natural language data to generate detailed protein descriptions directly from amino acid sequences.
Abstract: Understanding the functionality of proteins has been a focal point of biological research due to their critical roles in various biological processes. However, this endeavor is challenging due to the complex nature of proteins, requiring sophisticated experimental designs and extended timelines to uncover their specific functions. In this work, we introduce BetaDescribe, a collection of models designed to generate detailed and rich textual descriptions of proteins, encompassing properties such as function, catalytic activity, involvement in specific metabolic pathways, subcellular localizations, and the presence of specific domains. The trained BetaDescribe model receives protein sequences as input and outputs a textual description of these properties. The model was trained on datasets containing both biological and English text, which allowed the incorporation of biological knowledge. We demonstrate the utility of BetaDescribe by providing descriptions for proteins that share little to no sequence similarity to proteins with functional descriptions in public datasets. Using in-silico mutagenesis, we show that BetaDescribe relies on functionally important regions, as part of its prediction, suggesting that the model identifies regions of importance for the protein functionality without needing homologous sequence. BetaDescribe offers a powerful tool to explore protein functionality, augmenting existing approaches such as annotation transfer based on sequence or structure similarity.
Submission Number: 15
Loading