DDS-E-Sim: A Transformer-based Probabilistic Generative Framework for Simulating Error-Prone DNA Sequences for DNA Data Storage
Keywords: Probabilistic generative model, DNA data storage, Stochastic sequence generation, Error distribution
TL;DR: We present DDS-E-Sim, a universal transformer-based probabilistic generative framework that stochastically simulates erroneous DNA reads, faithfully mimicking the DNA data storage pipeline across various technologies and processes.
Abstract: DNA has emerged as a promising medium for long-lasting data stoage due to its high information density and long-term stability. However, DNA storage is a complex process where each stage introduces noise and errors. Since running DNA data storage experiments in vitro is still expensive and time-consuming, a simulation model is quite necessary that can mimic the error patterns in the real data and simulate the experiments. Existing tools often rely on fixed error rates or are specific to certain technologies. We propose DDS-E-Sim, a transformer-based probabilistic generative framework that simulates errors in a DNA data storage channel, regardless of the process or technology. DDS-E-Sim successfully captures the error distribution of DNA storage pipelines and learns to stochastically generate erroneous DNA reads. Given oligos (DNA sequences to write), it outputs erroneous reads resembling real pipelines capturing both random and biased errors, such as k-mer and transition errors. Evaluations on two distinct technology-specific datasets show high fidelity and universality: DDS-E-SIM exhibit a total error rate deviation of only 0.1% and 0.7% respectively on the datasets processed with Illumina MiSeq and Oxford Nanopore. Additionally, our simulator generates 100,743 unique oligos from 35,329 sequences, with coverage 5 (each sequence read five times) in the test datasets, demonstrating its ability to simulate biased errors and stochastic properties simultaneously.
Submission Number: 47
Loading