Diversified Semantic Distribution Matching for Dataset Distillation

Hongcheng Li; Yucan Zhou; Xiaoyan Gu; Bo Li; Weiping Wang

Diversified Semantic Distribution Matching for Dataset Distillation

Hongcheng Li, Yucan Zhou, Xiaoyan Gu, Bo Li, Weiping Wang

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Dataset distillation, also known as dataset condensation, offers a possibility for compressing a large-scale dataset into a small-scale one (i.e., distilled dataset) while achieving similar performance during model training. This method effectively tackles the challenges of training efficiency and storage cost posed by the large-scale dataset. Existing dataset distillation methods can be categorized into Optimization-Oriented (OO)-based and Distribution-Matching (DM)-based methods. Since OO-based methods require bi-level optimization to alternately optimize the model and the distilled data, they face challenges due to high computational overhead in practical applications. Thus, DM-based methods have emerged as an alternative by aligning the prototypes of the distilled data to those of the original data. Although efficient, these methods overlook the diversity of the distilled data, which will limit the performance of evaluation tasks. In this paper, we propose a novel Diversified Semantic Distribution Matching (DSDM) approach for dataset distillation. To accurately capture semantic features, we first pre-train models for dataset distillation. Subsequently, we estimate the distribution of each category by calculating its prototype and covariance matrix, where the covariance matrix indicates the direction of semantic feature transformations for each category. Then, in addition to the prototypes, the covariance matrices are also matched to obtain more diversity for the distilled data. However, since the distilled data are optimized by multiple pre-trained models, the training process will fluctuate severely. Therefore, we match the distilled data of the current pre-trained model with the historical integrated prototypes. Experimental results demonstrate that our DSDM achieves state-of-the-art results on both image and speech datasets. Codes will be released soon.

Primary Subject Area: [Engagement] Summarization, Analytics, and Storytelling

Secondary Subject Area: [Engagement] Summarization, Analytics, and Storytelling, [Content] Media Interpretation, [Systems] Data Systems Management and Indexing

Relevance To Conference: The rapid progress of multimedia technology and its wide range of commercial applications is a direct reflection of the valuable value extracted from the vast amount of data. In the era of big data, massive data brings unprecedented challenges to data storage and training efficiency. Recent advances in the field of dataset distillation have shown us a possibility that promises to compress the large-scale dataset into the smaller one but information-rich subsets. However, existing methods tend to ignore the semantic features of the data during processing, which limits the effectiveness of their practical applications to some extent. In view of this, this paper proposes a method using semantic diversity distribution matching to extract the deep value in data. We hope that through this innovative approach, we can distill the complex multimedia content into a concise and profound presentation. This approach is not only expected to improve the efficiency of data processing and storage, but also to provide new perspectives and ways to understand the intrinsic meaning of multimedia data.

Submission Number: 852

Loading