DNA Compression with Genomic Language Models: Tokenization, Benchmarking, and an Information-Content Map

Vojtěch Máčala; Petr Simecek

DNA Compression with Genomic Language Models: Tokenization, Benchmarking, and an Information-Content Map

Vojtěch Máčala, Petr Simecek

Published: 28 May 2026, Last Modified: 09 Jun 2026GenBio 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: genomic language models, DNA compression, arithmetic coding, genomic foundation models, tokenization, information content, genome browser, lossless compression

TL;DR: Pairing genomic LMs with arithmetic coding turns log-likelihood into a unit-free benchmark and a per-nucleotide information-content map of the human genome.

Abstract: Lossless compression and probabilistic sequence modeling are two faces of the same coin: a model that assigns high probability to a sequence can encode it in few bits via arithmetic coding. We exploit this duality to evaluate genomic language models as compressors of DNA, using compression primarily as an objective probe of generative sequence modeling rather than as a deployable storage system. We release DNAGPT2, a family of ten GPT-2-small models pretrained for one epoch on a single A40 using the DNABERT2 multi-species corpus that differ only in byte-pair encoding vocabulary size. Coupled with arithmetic coding, the best model reaches 1.47 bits per base on the T2T human genome, fourth in the Cobilab compression benchmark and ahead of every general-purpose compressor. Our results suggest that NLP-style tokenization choices may be suboptimal for DNA: a 32-token BPE vocabulary compresses better than larger vocabularies. We also find that, in this benchmark, published long-context genomic LMs underperform a much shorter-context BPE GPT-2; this is not a controlled context-length ablation, since the compared models also differ in architecture, training data, parameter count, and tokenization. Finally, we compute a per-nucleotide information-content map of the human genome and show that exons, introns, intergenic regions, and Alu repeats have statistically distinct information profiles.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 145

Loading