Archiving Submission: Yes (archival)
Keywords: Motif Preservation, GeneticBPE, Conserved Regions, Biological Sequence Modeling
TL;DR: To prevent fragmentation of biological signals, GeneticBPE modifies BPE to preserve crucial motifs during tokenization, boosting miRNA model performance.
Abstract: Tokenization plays a foundational yet underexplored role in biological sequence modeling. In this work, we present **GeneticBPE**, a biologically informed tokenisation framework that encodes prior structural knowledge such as seed motifs and conserved regions into the vocabulary construction process. Unlike standard subword methods that optimize purely for frequency or language-model likelihood, GeneticBPE integrates motif preservation objectives and generalisation-aware constraints into a modified merge scoring scheme. We evaluate our method on binary and multiclass miRNA classification tasks using the MirGeneDB v3.0 dataset and show that GeneticBPE outperforms character-level, k-mer, Unigram, and BPE tokenisations in accuracy, cross-species generalisation, and motif fidelity. Theoretical results demonstrate that tokenisation directly governs the inductive bias and domain robustness of sequence models. Our findings suggest that tokenisation should not be treated as a preprocessing utility, but rather as a design-critical component in biological NLP pipelines.
Reproducibility: Code, motif files and pretrained tokenizer will be released under MIT license upon acceptance.
Submission Number: 36
Loading