PubLabeler: Enhancing Automatic Classification of Publications in UniProtKB Using Protein Textual Description and PubMedBERT

Published: 01 Jan 2025, Last Modified: 12 Aug 2025IEEE J. Biomed. Health Informatics 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In UniProtKB, each protein is linked to numerous publications covering topics such as sequence, function, and structure, which are annotated manually or through automated methods. Given the vast number of proteins and literature, manual annotation is time-consuming and labour-intensive. Although UniProtKB offers automated annotations, their quality often falls short. Therefore, developing an accurate automated classifier to identify the topics of publications associated with each protein is imperative for advancing biomedical knowledge discovery. Classifying publications in UniProtKB involves protein-publication pairs characterized by multi-label, label co-occurrence, and class imbalance, which increases complexity. This paper proposes a novel method called PubLabeler, which simultaneously considers protein description and scientific literature texts as input. PubLabeler employs the PubMedBERT model to encode input texts and integrates label co-occurrence information into the model parameters. Additionally, it uses focal loss to update parameters, allowing the model to focus more on classes with a few instances. Using newly annotated literature from Swiss-Prot in 2023 as a test set, PubLabeler achieved superior results in both micro and macro metrics, showing a 28.5% improvement in macro-F1 compared to UniProtKB's automated annotation method, UPCLASS. Furthermore, we validated PubLabeler's effectiveness in TrEMBL annotation, showcasing its comprehensive prediction results compared to TrEMBL's automated annotations. These findings highlight PubLabeler's reliability and potential to advance protein-related information extraction and knowledge discovery.
Loading