Diverse Condensed Data Generation via Class Preserving Distribution Matching

TMLR Paper4411 Authors

06 Mar 2025 (modified: 27 May 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large-scale datasets for training many real-world machine learning models pose significant computational resource challenges. One approach to mitigate this is via data condensation, which aims at learning a small dataset but still sufficiently capturing the rich information in the original one. Most of existing approaches learn the condensed dataset and task-related model parameters (e.g., classifier) in a bi-level meta-learning way. The recently proposed distribution matching (DM), however, avoids the expensive bi-level optimization but ignores task-related models. This work proposes a novel class preserving DM framework consisting of two key components. The first one is responsible for capturing the original data distribution of each class based on energy distance, which can encourage the diversity in the generated synthetic data. The other is classifier-critic constraint, which forces the learned synthetic samples to fit pre-trained task-related models, such as an off-the-shelf classifier. Designing the optimization loss in this way, we can generate more diverse and class preserving distilled data without the bi-level optimization. Extensive experiments reveal that our method can produce more effective condensed data for downstream tasks with less training cost and can also be successfully applied to de-biased dataset condensation.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Andreas_Kirsch1
Submission Number: 4411
Loading