Abstract: Localized errors, which occur in windows with bounded lengths, are common in a range of applications. Such errors can be modeled as k-substring edits, which replace one substring with another string, both with lengths upper bounded by k. This generalizes errors such as localized deletions or burst substitutions studied in the literature. In this paper, we show through statistical analysis of real data that substring edits better describe differences between related documents compared to independent edits, and thus commonly arise in problems related to data synchronization. We also show that for the dataset under study, assuming codes exist that can achieve the Gilbert-Varshamov bound, substring-edit-correcting codes can synchronize two documents with much lower overhead compared to general indel/substitution-correcting codes. Furthermore, given a constant k, we construct binary codes of length n for correcting a k-substring edit with redundancy of roughly 2logn, compared to 8logn, the lowest redundancy achievable by an existing code for this problem. The time complexities of both encoding and decoding are polynomial with respect to n.
Loading