MSDZip: Universal Lossless Compression for Multi-source Data via Stepwise-parallel and Learning-based Prediction

Published: 29 Jan 2025, Last Modified: 29 Jan 2025WWW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Systems and infrastructure for Web, mobile, and WoT
Keywords: multi-source data, lossless data compression, neural networks, deep learning, parallel computing
Abstract: With the rapid development of the Internet, the huge amount of multi-source data (MSD) brings challenges in data sharing and storing. Lossless data compression is the major way to solve those problems. Nowadays, neural-network technologies bring significant advantages in data modeling, making learning-based lossless compressors (LLCs) for multi-source data have emerged continuously. Compared with traditional compressors, the LLCs are more useful to catch complex redundancy patterns in MSD, and thus have great potential in enhancing compression ratio. However, existing LLCs still suffer from unsatisfactory compression ratios and lower throughput. To solve those problems, we propose a novel universal MSD lossless compressor called MSDZip via Stepwise-parallel and learning-based prediction technologies, it introduces two major designs: 1) We propose a Local-Global-Deep Mixing block in the learning-based prediction module to establish dependencies for MSD symbols, where designed Deep Mixing block solves the problem of unstable weights in the perceptual layers caused by cold-start problem to enhance the compression ratio significantly. 2) We design a Stepwise-parallel multi-GPU-accelerated compression strategy to address the compression speed and graphics memory constraints of single GPU in the face of large-scale data. The Stepwise-parallel module passes the source MSD to learning-based prediction model through the data chunking strategy, where the model of the previous chunk is used to guide the compression of the next chunk in parallel. We compare MSDZip with 5 classical learning-based and 6 traditional compressors on 12 well-studied real-world datasets. The experimental results demonstrate that MSDZip optimizes 3.418%-69.874% in terms of compression ratio and 31.171%-495.649% in terms of throughput compared to advanced LLCs. The source code of MSDZip and the linkages of the experimental datasets are available at https://anonymous.4open.science/r/MSDZip-0E4E/.
Submission Number: 1793
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview