Keywords: Promoter recognition, DNA sequence classification, Motif-based models, Deep learning, Computational biology, Machine learning in genomics, TATA box
TL;DR: This paper shows that a simple motif-based model slightly outperforms a neural network in recognizing synthetic promoter DNA sequences, highlighting that classical methods can rival deep learning on simple tasks.
Abstract: We compare a neural network model to a classical motif-based classifier for identifying promoter DNA sequences. We
generate a synthetic dataset of 100-bp sequences, half containing an embedded promoter motif (“TATAAT”) and half
without. A baseline logistic regression on 3-mer counts (motif-based) is trained, alongside a feedforward neural network on
one-hot encoded sequences. Results show the motif-based classifier achieves accuracy 0.765 ± 0.009 and AUC 0.845 ±
0.008, slightly outperforming the neural network (accuracy 0.746 ± 0.012, AUC 0.825 ± 0.014). We release code and data
generation scripts for full reproducibility.
Supplementary Material: pdf
Submission Number: 253
Loading