LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Genomic foundation models, Learnable tokenization, Efficient sequence modeling, Hierarchical architectures
TL;DR: We propose LDarNet, a hierarchical genomic foundation model that adapts H-Net’s dynamic tokenization to the masked language modeling paradigm.
Abstract: Genomic foundation models increasingly adopt large language model architectures, yet almost all rely on fixed tokenization schemes such as $k$-mers or BPE. These approaches impose arbitrary sequence boundaries and risk discarding biologically relevant signals. Recent work introduced dynamic hierarchical tokenization in an autoregressive setup, demonstrating the feasibility of adaptive tokenization but leaving masked language modeling and downstream evaluation unexplored. We present \textbf{LDARNet}, a 120M-parameter hierarchical genomic foundation model that adapts hierarchical compression to the masked language modeling paradigm. LDARNet combines BiMamba-2 state-space layers with selective attention and uses ratio-based regularization to learn stable token boundaries without supervised segmentation. We evaluate LDARNet through comprehensive fine-tuning across 27 diverse tasks from the Genomics Benchmarks and Nucleotide Transformer suites, comparing against state-of-the-art models spanning 8M-2.5B parameters. LDARNet achieves 11 of 18 wins among compact models ($<$300M parameters) - a 5.5-fold improvement over the next-best alternatives - and establishes overall best performance on 5 challenging histone modification tasks, surpassing even 2.5B-parameter competitors. Notably, LDARNet wins 7 of 10 histone modification benchmarks, demonstrating that learnable compression boundaries effectively capture the long-range dependencies critical for epigenetic regulation modeling. These findings provide evidence that adaptive tokenization under masked language modeling yields biologically meaningful representations, and highlight hierarchical compression as a promising direction for efficient and scalable genomic foundation models.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 20856
Loading