Your Dataset is a Multiset and You Should Compress it Like One

Daniel Severo; James Townsend; Ashish J Khisti; Alireza Makhzani; Karen Ullrich

Your Dataset is a Multiset and You Should Compress it Like One

Daniel Severo, James Townsend, Ashish J Khisti, Alireza Makhzani, Karen Ullrich

Published: 08 Dec 2021, Last Modified: 05 May 2023DGMs and Applications @ NeurIPS 2021 OralReaders: Everyone

Keywords: compression, information theory, bits-back, generative models, multisets, permutations, entropy coding, datasets

TL;DR: A computationally efficient method that allows any codec to reduce it's bitrate, at the expense of shuffling the dataset in the process.

Abstract: Neural Compressors (NCs) are codecs that leverage neural networks and entropy coding to achieve competitive compression performance for images, audio, and other data types. These compressors exploit parallel hardware, and are particularly well suited to compressing i.i.d. batches of data. The average number of bits needed to represent each example is at least the well-known cross-entropy. However, the cross-entropy bound assumes the order of the compressed examples in a batch is preserved, which in many applications is not necessary. The number of bits used to implicitly store the order information is the logarithm of the number of unique permutations of the dataset. In this work, we present a method that reduces the bitrate of any codec by exactly the number of bits needed to store the order, at the expense of shuffling the dataset in the process. Conceptually, our method applies bits-back coding to a latent variable model with observed symbol counts (i.e. multiset) and a latent permutation defining the ordering, and does not require retraining any models. We present experiments with both lossy off-the-shelf codecs (WebP) as well as lossless NCs. On Binarized MNIST, lossless NCs achieved savings of up to $7.6\%$, while adding only $10\%$ extra compute time.

1 Reply

Loading