Abstract: Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between $$ [0.5 - c_{{GC}}, 0.5 + c_{{GC}} ] $$ (GC content constraint $$c_{GC}$$ ). Sequencing or synthesis errors tend to increase when these constraints are violated. In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when $$h = 4$$ and $$c_{GC} = 0.05$$ , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub. We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.
External IDs:dblp:journals/bmcbi/GaoN24
Loading