Keywords: Dataset Condensation
TL;DR: Heterogeneous Dataset Condensation
Abstract: Dataset Condensation (DC) is a powerful technique for reducing large-scale training costs, but its effectiveness is largely confined to homogeneous data. When confronted with heterogeneous datasets from multiple sources, existing DC methods falter, often collapsing toward dominant visual styles and discarding crucial domain-specific information. To address this critical limitation, we propose Condensing Heterogeneous Datasets without Domain Labels (CHDDL), a novel framework that embeds rich domain diversity directly into synthetic images. CHDDL achieves this through a domain-aware module that employs learnable spatial masks, guided by a lightweight and entirely unsupervised FFT-based pseudo-labeling scheme. Crucially, our approach operates without requiring explicit domain labels and preserves the original Images Per Class (IPC) budget, making it a practical, plug-and-play enhancement for existing DC methods. Extensive experiments demonstrate that CHDDL consistently outperforms strong baselines across single-domain, multi-domain, and cross-architecture generalization settings, highlighting its potential as a key component for robust dataset condensation in realistic, multi-source environments.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 6504
Loading