Keywords: Dataset Condensation
Abstract: Data condensation (DC) technologies are widely used in buffer-constrained scenarios to reduce the memory demand of training samples and maintain DNN training performance. However, due to the storage constraint of deployment devices and the high energy costs of condensation procedure, synthetic datasets generated by DC often have inferior performance in terms of training efficiency and scalability, which greatly limits its practical application on various edge devices.
This dilemma arises due to two reasons: i) existing state-of-the-art (SoTA) data condensation approaches that update synthetic datasets by intuitively matching intermediate training outputs (e.g., gradients, features and distributions) between real datasets and synthetic datasets without improving their representational information capabilities from the perspective of the useful information contained. ii) DC lacks sufficient consideration for the heterogeneity of storage constraints among various edge devices, which will result in large training overheads (i.e., consumption or storage).
To tackle the above issue, We propose a novel method named Mixture-of-Information Bottleneck Dataset Condensation (MIBDC), which employs information bottlenecks from synthetic datasets with various Image Per Class (IPC) numbers to improve the overall DC generalization and scalability.
Specifically, in this paper, the following two phenomena are found: i) The quality of synthetic datasets improves with increased synthetic dataset quantity. ii) The smaller the number of synthetic datasets, the earlier they can reach the convergence peak.
Based on the above two findings, this paper proposes that i) large synthetic datasets can guide the better convergence of smaller ones. ii) information contained in synthetic datasets with different IPC numbers can play a collaborative role in the guidance of dataset condensation generalization.
Comprehensive experimental results on three well-known datasets show that, compared with state-of-the-art dataset condensation methods, MIBDC can not only enhance the generalization performance of trained models but also achieve superior scalability.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13264
Loading