Unsupervised identification of redundant domain entries in InterPro database using clustering techniquesOpen Website

2015 (modified: 17 Apr 2023)BCB 2015Readers: Everyone
Abstract: InterPro is a widely used database that integrates functional signatures provided by different protein sequence annotation databases with manual curation; in order to present a comprehensive database of functional sequence annotation. However, the integration of the signatures causes inconsistent and/or redundant annotations in some cases. In this study, we proposed an unsupervised method for the automatic detection of inconsistent and redundant entries in the InterPro database. Two clustering methods: Markov Cluster Algorithm (MCL) and hierarchical clustering are employed in order to investigate to what extent these signatures can be detected. Results show that a considerable amount of (~75%) redundant entries can be identified. The future goal is to develop a system that does the identification of redundant and inconsistent signatures with very high performance using machine learning techniques in a supervised fashion. The findings of the study may aid InterPro curators to fix the problematic entries. It may also be used by curators as a road map before the integration of new signatures.
0 Replies

Loading