Deep Learning versus Motif-Based Models for Promoter Sequence Recognition

16 Sept 2025 (modified: 06 Dec 2025)Agents4Science 2025 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Promoter recognition, DNA sequence classification, Motif-based models, Deep learning, Computational biology, Machine learning in genomics, TATA box
TL;DR: This paper shows that a simple motif-based model slightly outperforms a neural network in recognizing synthetic promoter DNA sequences, highlighting that classical methods can rival deep learning on simple tasks.
Abstract: We compare a neural network model to a classical motif-based classifier for identifying promoter DNA sequences. We generate a synthetic dataset of 100-bp sequences, half containing an embedded promoter motif (“TATAAT”) and half without. A baseline logistic regression on 3-mer counts (motif-based) is trained, alongside a feedforward neural network on one-hot encoded sequences. Results show the motif-based classifier achieves accuracy 0.765 ± 0.009 and AUC 0.845 ± 0.008, slightly outperforming the neural network (accuracy 0.746 ± 0.012, AUC 0.825 ± 0.014). We release code and data generation scripts for full reproducibility.
Supplementary Material: pdf
Submission Number: 253
Loading