SuperDelta: Multiple Referenced Base Chunks Scheme for Fine-grained Deduplication Backup Storage System
Abstract: Deduplication-based techniques are popular in backup storage systems for reducing data volume. To maximize data reduction, existing fine-grained deduplication approaches not only eliminate duplicate chunks but also delta-compress non-duplicate chunks as delta relative to their similar (base) chunks. However, each chunk may have multiple similar chunks, and delta compression usually only selects one of them as the base chunk, i.e., a one-to-one scheme. This scheme benefits the restore performance because it needs to read only one (instead of multiple) base chunk in decompressing delta chunks, while it also wastes the potential compressibility among other similar chunks.In this paper, we propose SuperDelta to further exploit compressibility across multiple similar chunks and to preserve the restore performance advantage of the one-to-one scheme as much as possible. It is based on three techniques. (1) To further eliminate redundancy among similar chunks, SuperDelta applies a "Multiple Referenced Base Chunks" (MRBC) scheme instead of the one-to-one scheme. It combines several similar pairs of chunks in delta encoding to recover possibly lost compressibility in "boundary shift" problems. (2) To avoid the negative side effects of MRBC on restore performance, SuperDelta introduces a rebase scheme to rebuild simple reference paths among duplicate and similar chunks. It significantly simplifies the restore workflow, but also costs slightly more storage space because of impacting the workflow of redundancy detection. (3) To compensate for the additional storage cost, SuperDelta applies a space-recycle scheme to remove derived data when they become old while ensuring the optimized restore performance of the latest backups.Experiments on four real-world backup datasets show that SuperDelta increases the overall compression ratio by 1.05~2.40 times than the traditional one-to-one fine-grained deduplication without significantly affecting the backup and restore throughput.
External IDs:dblp:conf/dcc/TanZWGX24
Loading