Abstract: Dataset distillation improves neural network training efficiency by compressing large real datasets into compact synthetic datasets. Existing methods typically optimize matching objectives, such as aligning gradients, features, and trajectories between the synthetic and original datasets to ensure the distilled data retains essential properties for model training. However, many of these approaches rely on predefined distillation pools to streamline the process or treat all real data points equally, overlooking the dynamic nature of the synthetic dataset’s training requirements during optimization. To address these limitations, we propose Active Dataset Distillation via Dual-Space Informative Matching (ACDD), an active learning-based algorithm that dynamically selects the most informative real data subset to align with the synthetic dataset’s evolving needs. By adaptively refining the distillation pool, ACDD enhances training efficiency and generalization while ensuring the synthetic dataset effectively captures the original data’s key characteristics. ACDD operates through two interconnected loops: the dual-space active loop (DAL) and the distillation loop. DAL plays a key role by dynamically selecting samples that balance diversity and uncertainty, adding them to the target distillation pool to meet the evolving informational needs of the current distillation loop. As a result, ACDD enables the synthetic dataset to achieve superior performance compared to SOTA methods across multiple benchmarks, including SVHN, CIFAR-10, CIFAR-100, TinyImageNet, and ImageNet subset. Moreover, ACDD reduces the required real dataset to just 20%–40% of the original, demonstrating its efficiency and effectiveness in data distillation.
Loading