Keywords: Dataset distillation, coreset selection
Abstract: The power of state-of-the-art deep learning models heavily depends on large amounts (millions or even billions) of training data, which hinders researchers
having limited resources from conducting relevant researches and causes heavy CO2 emission. Dataset distillation methods are thus developed to compress large
datasets into smaller ones to reduce model training cost, by synthesizing samples to match the original ones w.r.t. certain metrics like the training loss. However,
existing methods generally suffer poor scalability (not applicable to compressing large-scale datasets such as ImageNet), and limited generalizability for training
other model architectures. We empirically observe the reason is that the condensed datasets have lost the sample diversity of the original datasets. Driven by
this, we study dataset compression from a new perspective—what is the minimum number of pixels necessary to represent the whole dataset without losing its diversity?—and develop a new dataset quantization (DQ) framework. DQ conducts compression at two levels: the sample level and the pixel level. It introduces a
sample-level quantizer to find a compact set of samples to better represent distribution of the full dataset and a pixel-level quantizer to find the minimum number of pixels to describe every single image. Combining these two quantizers, DQ achieves new state-of-the-art dataset lossless compression ratio and provides
compressed datasets practical for training models with a large variety of architectures. Specifically, for image classification, it successfully removes 40% data
with only 0.4% top-5 accuracy drop on ImageNet and almost zero accuracy drop on CIFAR-10. We further verify that the model weights pre-trained on the 40%
compressed dataset only lose 0.2% mAP on COCO dataset for object detection and 0.3% mIoU on ADE20k for segmentation. Code will be made public.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
Supplementary Material: zip
5 Replies
Loading