DNA Sequence Classification: An Advanced Machine Learning Framework For Accurate Splice Junction Detection
Abstract: In the context of genomic data analysis, DNA
splice junction classification is a critical task for understanding
gene expression, as these junctions are sites where introns
are removed and exons are joined. Accurate identification of
splice junctions is essential for deciphering gene functionality.
Traditional methods, such as sequence alignment, are often slow
and computationally intensive, especially when processing large-
scale DNA datasets. To address this, we developed and evaluated
multiple machine learning (ML) and deep learning (DL) models
for the accurate classification of splice junctions. Our goal
was to enhance classification accuracy, reduce computational
costs, and provide a comparative analysis of different modeling
approaches to advance research in genomic data analysis. We
employed a methodological framework that included traditional
ML algorithms, such as Random Forest, Gradient Boosting,
Decision Tree, Support Vector Machine (SVM), and XGBoost,
as well as contemporary DL architectures like Convolutional
Neural Networks (CNNs) and Recurrent Neural Networks
(RNNs). The data preprocessing pipeline incorporated one-hot
encoding for optimal feature representation. Empirical results
demonstrated the superior performance of ensemble learn-
ing methods, with Gradient Boosting and XGBoost achieving
exceptional classification accuracies of 97.34% and 97.02%,
respectively. Among DL models, CNNs outperformed RNNs,
achieving 94.51% accuracy compared to 93.89% for RNNs. The
results underscore the exceptional performance of tree-based
ensemble methods for splice junction classification, highlighting
their superior discriminative power and effectiveness in genomic
sequence analysis.
Loading