Elucidating the Design Space of Data Condensation

Shitong Shao, zikai zhou, Huanran Chen, Zhiqiang Shen

Published: 27 Sept 2024, Last Modified: 29 Sept 2024NeurIPS 2024EveryoneCC BY 4.0

Abstract: Dataset condensation, a concept within \textit{data-centric learning}, efficiently transfers critical attributes from an original dataset to a synthetic version, maintaining both diversity and realism. This approach significantly improves model training efficiency and is adaptable across multiple application areas. Previous methods in dataset condensation have faced challenges: some incur high computational costs which limit scalability to larger datasets (\textit{e.g.,} MTT, DREAM, and TESLA), while others are restricted to less optimal design spaces, which could hinder potential improvements, especially in smaller datasets (\textit{e.g.,} SRe$^2$L, G-VBSM, and RDED). To address these limitations, we propose a comprehensive design framework that includes specific, effective strategies like implementing soft category-aware matching and adjusting the learning rate schedule. These strategies are grounded in empirical evidence and theoretical backing. Our resulting approach, \textbf{E}lucidate \textbf{D}ataset \textbf{C}ondensation (\textbf{EDC}), establishes a benchmark for both small and large-scale dataset condensation. In our testing, EDC achieves state-of-the-art accuracy, reaching 48.6\% on ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a compression ratio of 0.78\%. This performance exceeds those of SRe$^2$L, G-VBSM, and RDED by margins of 27.3\%, 17.2\%, and 6.6\%, respectively.