Keywords: genomic language models, tokenization, interpretability, genome regulation, DNA
TL;DR: We introduce a biologically grounded tokenization scheme that segments DNA into biologically meaningful “words,” preserving Genomic LM performance while improving interpretability and computational efficiency.
Abstract: Genomic Language Models achieve strong performance on biological tasks but rely on tokenization methods that overlook the complexity of the genome. We introduce a biologically grounded tokenization strategy that partitions the DNA sequence into meaningful “words” based on transcription factor (TF) motifs. Embedding biological insight into vocabulary design preserves predictive power while potentially improving interpretability and computational efficiency. Proof-of-concept results demonstrate that motif-informed tokens generate representations that better capture the language of gene regulation, opening the door to models that are both highly predictive and capable of decoding the regulatory genomic grammar vital for drug discovery and precision medicine.
Submission Number: 48
Loading