DNA Sequence Classification: An Advanced Machine Learning Framework For Accurate Splice Junction Detection

Md Abubakkar

Published: 17 Oct 2025, Last Modified: 29 Oct 2025Subang Jaya, MalaysiaEveryoneCC BY-NC 4.0

Abstract: In the context of genomic data analysis, DNA splice junction classification is a critical task for understanding gene expression, as these junctions are sites where introns are removed and exons are joined. Accurate identification of splice junctions is essential for deciphering gene functionality. Traditional methods, such as sequence alignment, are often slow and computationally intensive, especially when processing large- scale DNA datasets. To address this, we developed and evaluated multiple machine learning (ML) and deep learning (DL) models for the accurate classification of splice junctions. Our goal was to enhance classification accuracy, reduce computational costs, and provide a comparative analysis of different modeling approaches to advance research in genomic data analysis. We employed a methodological framework that included traditional ML algorithms, such as Random Forest, Gradient Boosting, Decision Tree, Support Vector Machine (SVM), and XGBoost, as well as contemporary DL architectures like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). The data preprocessing pipeline incorporated one-hot encoding for optimal feature representation. Empirical results demonstrated the superior performance of ensemble learn- ing methods, with Gradient Boosting and XGBoost achieving exceptional classification accuracies of 97.34% and 97.02%, respectively. Among DL models, CNNs outperformed RNNs, achieving 94.51% accuracy compared to 93.89% for RNNs. The results underscore the exceptional performance of tree-based ensemble methods for splice junction classification, highlighting their superior discriminative power and effectiveness in genomic sequence analysis.