Abstract: Language Models (LM) have been extensively utilized for learning DNA sequence patterns and generating synthetic sequences. In this paper, we present a novel approach for the generation of synthetic DNA data using pangenomes in combination with LM. We introduce three innovative pangenome-based tokenization schemes, including two that can decouple from private data, while enhance long DNA sequence generation. Our experimental results demonstrate the superiority of pangenome-based tokenization over classical methods in generating high-utility synthetic DNA sequences, highlighting a promising direction for the public sharing of genomic datasets.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Generation; Language Modeling; NLP Applications; Syntax: Tagging, Chunking and Parsin
Contribution Types: NLP engineering experiment
Languages Studied: DNA sequence
Keywords: Pangenome Graph; DNA Generation; DNA Tokenization; NLP Applications;
Submission Number: 504
Loading