Enhancing Gene Cluster Identification and Classification in Bacterial Genomes Through Synonym Replacement and Deep Learning

Published: 01 Jan 2024, Last Modified: 13 May 2025BIBM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Natural products, abundant in biological activities, are synthesized by bacteria during secondary metabolism, yielding a plethora of antibiotics, anticancer drugs, and other potential therapeutics. These compounds are directed by biosynthetic gene clusters (BGCs), which encode the necessary enzymes and regulatory elements for their production. With the rapid advancement of sequencing technology, the number of fully sequenced bacterial genomes has exponentially increased, providing researchers with an unprecedented opportunity to discover more BGCs. However, current identification methods are significantly constrained by their reliance on database quantity and predominantly utilize local features inherent in the BGC sequences. In this study, we introduce Biosynthetic Gene Clusters Catcher: GCN-BERT(BGCCGB), a deep learning model leveraging natural language processing for the identification of BGCs in bacterial genomes and classification of BGCs product types. Our approach represents BGCs as sequences represented by protein family (Pfam) identifiers and performs data augmentation on existing BGC data by introducing a synonym replacement method. Local and global features in BGCs sequences are then integrated by BERT and its self-attention mechanism, in conjunction with Graph Convolutional Networks (GCN). The experiment results indicate that BGCCGB effectively learns the features of BGCs, localizes BGCs in bacterial genomes, and predicts the product types of BGCs. Notably, BGCCGB demonstrates a significant performance improvement over previous methods, with an average precision 6.9% in BGC identification. These enhancements underscore the potential of BGCCGB in advancing the discovery and understanding of natural products synthesized by bacteria.
Loading