DNA promoter task-oriented dictionary mining and prediction model based on natural language technology
Abstract: Promoters are essential DNA sequences that initiate transcription and regulate gene expression.
Precisely identifying promoter sites is crucial for deciphering gene expression patterns and the roles
of gene regulatory networks. Recent advancements in bioinformatics have leveraged deep learning
and natural language processing (NLP) to enhance promoter prediction accuracy. Techniques such
as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and BERT
models have been particularly impactful. However, current approaches often rely on arbitrary DNA
sequence segmentation during BERT pre-training, which may not yield optimal results. To overcome
this limitation, this article introduces a novel DNA sequence segmentation method. This approach
develops a more refined dictionary for DNA sequences, utilizes it for BERT pre-training, and employs
an Inception neural network as the foundational model. This BERT-Inception architecture captures
information across multiple granularities. Experimental results show that the model improves the
performance of several downstream tasks and introduces deep learning interpretability, providing new
perspectives for interpreting and understanding DNA sequence information. The detailed source code
is available at
Loading