PatchDNA: A Flexible and Biologically-Informed Alternative to Tokenization for DNA
Keywords: DNA, DNA language model, gLM, tokenization, genomic sequence representation
TL;DR: Evolutionary conservation–guided “patch” boundaries focus model capacity on the most functionally important regions, yielding smaller models that nonetheless outperform current state-of-the-art benchmarks and, uniquely, permit on-the-fly re-patching
Submission Number: 7
Loading