Beyond the Bases: Unleashing Overlapping DNA Tokenization via Unified Linear-Time Autoregressive

Haozhe Hu

Beyond the Bases: Unleashing Overlapping DNA Tokenization via Unified Linear-Time Autoregressive

Haozhe Hu

18 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Genomic Language Model; Linear Attention

TL;DR: We present HGDNA, a hybrid Gated DeltaNet-based genomic language model with unified CLM training formulation and overlapping k-mer tokenization.

Abstract: The domain of genomic language models (gLMs) has advanced rapidly, with models pretrained on diverse multi-species genomic corpus demonstrating their remarkable capabilities. While the effect on long-context modeling of simplest nucleotide-level tokenization has already been proven, overlapping k-mer, which provides richer neighborhood information, has been neglected in the existing gLM designs. Herein, we provide a thorough revisit of the overlapping tokenization and present HGDNA, a hybrid linear attention gLM under a unified causal language modeling (CLM) paradigm across pretraining and fine-tuning through species classification auxiliary task and shared class tokens. HGDNA provides superior capability across various classification, zero-shot embedding, and instruction-based sequence design tasks, demonstrating its robust performance and notable efficiency across both short-range and long-range tasks.

Supplementary Material: zip

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Submission Number: 11916

Loading