Diversified Semantic Distribution Matching for Dataset Distillation

Hongcheng Li, Yucan Zhou, Xiaoyan Gu, Bo Li, Weiping Wang

Published: 01 Jan 2024, Last Modified: 06 Mar 2025ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Dataset distillation, also known as dataset condensation, offers a possibility for compressing a large-scale dataset into a small-scale one (i.e., distilled dataset) while achieving similar performance during model training. This method effectively tackles the challenges of training efficiency and storage cost posed by the large-scale dataset. Existing dataset distillation methods can be categorized into Optimization-Oriented (OO)-based and Distribution-Matching (DM)-based methods. Since OO-based methods require bi-level optimization to alternately optimize the model and the distilled data, they face challenges due to high computational overhead in practical applications. Thus, DM-based methods have emerged as an alternative by aligning the prototypes of the distilled data to those of the original data. Although efficient, these methods overlook the diversity of the distilled data, which will limit the performance of evaluation tasks. In this paper, we propose a novel Diversified Semantic Distribution Matching (DSDM) approach for dataset distillation. To accurately capture semantic features, we first pre-train models for dataset distillation. Subsequently, we estimate the distribution of each category by calculating its prototype and covariance matrix, where the covariance matrix indicates the direction of semantic feature transformations for each category. Then, in addition to the prototypes, the covariance matrices are also matched to obtain more diversity for the distilled data. However, since the distilled data are optimized by multiple pre-trained models, the training process will fluctuate severely. Therefore, we match the distilled data of the current pre-trained model with the historical integrated prototypes. Experimental results demonstrate that our DSDM achieves state-of-the-art results on both image and speech datasets. Code is available at https://github.com/Li-Hongcheng/DSDM.