PatchDNA: A Flexible and Biologically-Informed Alternative to Tokenization for DNA

Alice Del Vecchio; Chantriolnt-Andreas Kapourani; Abdullah M Athar; Agnieszka Dobrowolska; Andrew Anighoro; Benjamin Tenmann; Lindsay Edwards; Cristian Regep

PatchDNA: A Flexible and Biologically-Informed Alternative to Tokenization for DNA

Alice Del Vecchio, Chantriolnt-Andreas Kapourani, Abdullah M Athar, Agnieszka Dobrowolska, Andrew Anighoro, Benjamin Tenmann, Lindsay Edwards, Cristian Regep

Published: 24 Sept 2025, Last Modified: 15 Oct 2025NeurIPS2025-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Track 1: Original Research/Position/Education/Attention Track

Keywords: DNA, DNA language model, gLM, tokenization, genomic sequence representation

TL;DR: Evolutionary conservation–guided “patch” boundaries focus model capacity on the most functionally important regions, yielding smaller models that nonetheless outperform current state-of-the-art benchmarks and, uniquely, permit on-the-fly re-patching

Abstract: DNA language models are emerging as powerful tools for representing genomic sequences, with recent progress driven by self-supervised learning. However, performance on downstream tasks is sensitive to tokenization strategies reflecting the complex encodings in DNA, where both regulatory elements and single-nucleotide changes can be functionally significant. Yet existing models are fixed to their initial tokenization strategy; single-nucleotide encodings result in long sequences that challenge transformer architectures, while fixed multi-nucleotide schemes like byte pair encoding struggle with character level modelling. We propose a biologically-informed alternative to tokenization using evolutionary conservation scores as a guide for `patch' boundaries, drawing inspiration from the Byte Latent Transformer's combining of bytes into patches. By prioritizing conserved regions, our approach directs computational resources to the most functionally relevant parts of the DNA sequence. We show that models up to an order of magnitude smaller surpass current state-of-the-art performance in existing DNA benchmarks. Importantly, our approach provides the flexibility to change patching without retraining, which is not offered by previous methods, while also improving downstream performance.

Submission Number: 266

Loading