Keywords: Genomic foundation models, Learnable tokenization, Efficient sequence modeling, Hierarchical architectures
TL;DR: We propose LDarNet, a hierarchical genomic foundation model that adapts H-Net’s dynamic tokenization to the masked language modeling paradigm.
Abstract: Genomic foundation models increasingly adopt large language model architectures, yet almost all rely on fixed tokenization schemes such as $k$-mers or byte-pair encoding. These approaches impose arbitrary sequence boundaries and risk discarding biologically relevant signals. Recent work on H-Net introduced dynamic hierarchical tokenization in an autoregressive setup, demonstrating the feasibility of adaptive tokenization on the genome but leaving downstream evaluation unexplored. We present \textbf{LDARNet}, a hierarchical genomic foundation model that adapts H-Net to the masked language modeling (MLM) paradigm. LDARNet combines BiMamba-2 outer layers operating at nucleotide resolution with a Transformer backbone in a compressed latent space, and uses a ratio regularizer to enforce stable learnable token boundaries. Pretrained on human and multispecies genomes, LDARNet is evaluated under a frozen embedding protocol with logistic regression probes across 26 tasks from the Genomics Benchmarks and Nucleotide Transformer suites. Despite the absence of task-specific finetuning, LDARNet achieves competitive performance with state-of-the-art Transformer baselines and sets new SOTA results on multiple histone modification tasks. These findings provide the first evidence that adaptive tokenization under MLM training yields biologically meaningful embeddings, and highlight hierarchical compression as a promising direction for scalable and interpretable genomic modeling.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 20856
Loading