Abstract: Language Models (LM) have been extensively utilized for learning DNA sequence patterns and generating synthetic sequences. In this paper, we present a novel approach for the generation of synthetic DNA data using pangenomes in combination with LM. We introduce three innovative pangenome-based tokenization schemes, including two that can decouple from private data, while enhance long DNA sequence generation. Our experimental results demonstrate the superiority of pangenome-based tokenization over classical methods in generating high-utility synthetic DNA sequences, highlighting a promising direction for the public sharing of genomic datasets.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Generation; Language Modeling; NLP Applications; Syntax: Tagging, Chunking and Parsin
Contribution Types: NLP engineering experiment
Languages Studied: DNA sequence
Previous URL: https://openreview.net/forum?id=OfF9mUzS3L
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 5.1
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: 5.1
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: Yes
B4 Elaboration: Data used (mentioned in 5.1) is collected and anonymised by the Human Pangenome Reference Consortium
B5 Documentation Of Artifacts: Yes
B5 Elaboration: 5.1 reference
B6 Statistics For Data: Yes
B6 Elaboration: 5.1 and 2.2
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 5.1, Appendix C and Table 1 in 5.2
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Appendix C
C3 Descriptive Statistics: N/A
C3 Elaboration: Not applicable to alignment figures, and we made it clear the scores are collected through 20 generations.
C4 Parameters For Packages: Yes
C4 Elaboration: 5.1
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: N/A
E1 Elaboration: Our use of AI assistants was limited to basic language assistance tools such as Grammarly and Writeful for grammar checking.
Author Submission Checklist: yes
Submission Number: 298
Loading