Lossless Dataset Compression Via Dataset Quantization

Zhou Daquan; Kai Wang; Jianyang Gu; Dongze Lian; Xiangyu Peng; Yifan Zhang; Yang You; Jiashi Feng

Lossless Dataset Compression Via Dataset Quantization

Zhou Daquan, Kai Wang, Jianyang Gu, Dongze Lian, Xiangyu Peng, Yifan Zhang, Yang You, Jiashi Feng

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Dataset distillation, coreset selection

Abstract: The power of state-of-the-art deep learning models heavily depends on large amounts (millions or even billions) of training data, which hinders researchers having limited resources from conducting relevant researches and causes heavy CO2 emission. Dataset distillation methods are thus developed to compress large datasets into smaller ones to reduce model training cost, by synthesizing samples to match the original ones w.r.t. certain metrics like the training loss. However, existing methods generally suffer poor scalability (not applicable to compressing large-scale datasets such as ImageNet), and limited generalizability for training other model architectures. We empirically observe the reason is that the condensed datasets have lost the sample diversity of the original datasets. Driven by this, we study dataset compression from a new perspective—what is the minimum number of pixels necessary to represent the whole dataset without losing its diversity?—and develop a new dataset quantization (DQ) framework. DQ conducts compression at two levels: the sample level and the pixel level. It introduces a sample-level quantizer to find a compact set of samples to better represent distribution of the full dataset and a pixel-level quantizer to find the minimum number of pixels to describe every single image. Combining these two quantizers, DQ achieves new state-of-the-art dataset lossless compression ratio and provides compressed datasets practical for training models with a large variety of architectures. Specifically, for image classification, it successfully removes 40% data with only 0.4% top-5 accuracy drop on ImageNet and almost zero accuracy drop on CIFAR-10. We further verify that the model weights pre-trained on the 40% compressed dataset only lose 0.2% mAP on COCO dataset for object detection and 0.3% mIoU on ADE20k for segmentation. Code will be made public.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

Supplementary Material: zip

5 Replies

Loading