When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes

Marina Popova; Iaroslav Chelombitko; Aleksey Komissarov

When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes

Marina Popova, Iaroslav Chelombitko, Aleksey Komissarov

Published: 05 Mar 2025, Last Modified: 23 Apr 2025MLGenX 2025EveryoneRevisionsBibTeXCC BY 4.0

Track: Main track (up to 8 pages)

Abstract: The emergence of telomere-to-telomere (T2T) genome assemblies has opened new avenues for comparative genomics, yet effective tokenization strategies for genomic sequences remain underexplored. In this pilot study, we apply Byte-Pair Encoding (BPE) to nine T2T primate genomes—including three human assemblies—by training independent BPE tokenizers with a fixed vocabulary of 512,000 tokens using our custom tool, dnaBPE. Our analysis reveals that only 11,569 tokens are shared across all assemblies, while nearly 991,854 tokens are unique to a single genome, indicating a rapid decline in shared vocabulary with increasing assembly comparisons. Moreover, phylogenetic trees derived from token overlap failed to recapitulate established primate relationships, a discrepancy attributed to the disproportionate influence of species-specific high-copy repetitive elements. These findings underscore the dual nature of BPE tokenization: while it effectively compresses repetitive sequences, its sensitivity to high-copy elements limits its utility as a universal tool for comparative genomics. We discuss potential hybrid strategies and repeat-masking approaches to refine genomic tokenization, emphasizing the need for domain-specific adaptations in the development of large-scale genomic language models. The dnaBPE tool used in this study is open-source and available at https://github.com/aglabx/dnaBPE.

Submission Number: 75

Loading