DARKIN: A zero-shot classification benchmark and an evaluation of protein language models

Published: 04 Mar 2024, Last Modified: 27 Apr 2024MLGenX 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Protein language models, dark kinases, phosphorylation, zero-shot learning
TL;DR: We present a zero-shot classification benchmark dataset, DARKIN, for assigning phosphosites to one of the understudied kinases (dark kinases) to evaluate protein language models representation power.
Abstract: Protein language models (pLMs) aim to capture the complex information embedded within protein sequences and are useful for downstream protein prediction tasks. With a plethora of pLMs available, there is now a critical need to benchmark their performance across diverse tasks. Here, we introduce a biologically relevant zero-shot prediction benchmark, focusing on dark kinase-phosphosite associations. Kinases are the enzymes responsible for protein phosphorylation and they play vital roles in cellular signaling. While phosphoproteomics allows large-scale identification of phosphosites, determining the catalyzing kinase remains challenging. We present a zero-shot classification benchmark dataset, DARKIN, for assigning phosphosites to one of the understudied kinases (dark kinases). DARKIN provides train, validation, and test folds split based on zero-shot classification, kinase groups, and sequence similarities. Evaluation of pLMs using a novel training-free k-NN-based zero-shot classifier and a bilinear zero-shot classifier reveals superior performance by Esm models, ProtT5-XL, and the recently introduced structure-based SaProt model. We believe this biologically relevant yet challenging benchmark will further facilitate assessing the efficacy of pLMs and aid the exploration of dark kinases.
Submission Number: 50
Loading