Keywords: Genomic Language Model; Linear Attention
TL;DR: We present HGDNA, a hybrid Gated DeltaNet-based genomic language model with unified CLM training formulation and overlapping k-mer tokenization.
Abstract: The domain of genomic language models (gLMs) has advanced rapidly, with models pretrained on diverse multi-species genomic corpus demonstrating their remarkable capabilities. While the effect on long-context modeling of simplest nucleotide-level tokenization has already been proven, overlapping k-mer, which provides richer neighborhood information, has been neglected in the existing gLM designs. Herein, we provide a thorough revisit of the overlapping tokenization and present HGDNA, a hybrid linear attention gLM under a unified causal language modeling (CLM) paradigm across pretraining and fine-tuning through species classification auxiliary task and shared class tokens. HGDNA provides superior capability across various classification, zero-shot embedding, and instruction-based sequence design tasks, demonstrating its robust performance and notable efficiency across both short-range and long-range tasks.
Supplementary Material: zip
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 11916
Loading