Abstract: The exponential availability of protein sequences has led to the dominance of the pretraining-then-finetuning paradigm for protein function prediction. However, finetuning a pretrained protein language model for diverse downstream tasks requires annotated protein data tailored to each task. To avoid the redundant individual finetuning, we propose a methodology that unifies various Protein function prediction tasks via Text Matching (named ProTeM). This method first transforms simple numeric or category labels from disparate protein datasets into textual descriptions, imbued with rich semantics. We then harness a pretrained large language model, which is proficient in comprehensive language understanding, to capture intrinsic interconnections among varied protein functions and facilitate the alignment between text and protein. During inference, we employ the paradigm of text matching to predict the protein functionalities. Extensive experiments demonstrate that ProTeM achieves performance on par with individually finetuned models, and outshines the model based on conventional multi-task learning. Moreover, ProTeM unveils an enhanced capacity for protein representation, surpassing state-of-the-art PLMs.
Loading