Data Deduplication with Random Substitutions

Hao Lou, Farzad Farnoud

Published: 2020, Last Modified: 14 May 2023ISIT 2020Readers: Everyone

Abstract: Data deduplication saves storage space by identifying and removing repeats in the data stream. In this paper, we provide an information-theoretic analysis of the performance of deduplication algorithms with data streams where repeats are not exact. We introduce a source model in which probabilistic substitutions are considered. Two modified versions of fixed-length deduplication are studied and proven to have performance within a constant factor of optimal with the knowledge of repeat length. We also study the variable-length scheme and show that as entropy becomes smaller, the size of the compressed string vanishes relative to the length of the uncompressed string.

0 Replies