Keywords: We propose a scalable and efficient long-tailed dataset distillation methods via energy loss matching.
TL;DR: Long Tailed Data, Data Synthesis
Abstract: Dataset distillation (DD) compresses large datasets into compact synthetic sets for efficient model training. Among various DD approaches, distribution matching (DM) has emerged as a promising direction due to its ability to bypass the computational complexity of bi-level optimization while maintaining strong performance. However, most current DM approaches adopt metrics that are not theoretically well-founded and thus fail to accurately capture distributional discrepancies. This stems from a lack of in-depth theoretical analysis of the metrics themselves. Therefore, we revisit existing metrics from the spectral domain and provide theoretical insights to guide future metric design. Based on this analysis, we propose Spectral Distribution Matching (SDM), a Fourier-based approach that introduces theoretically motivated, discriminative metrics and achieves linear computational complexity through a Fourier-based algorithm that enables fast and scalable computation. Our method not only proves effective on standard datasets, but also demonstrates superior performance on more challenging long-tailed datasets. To address the issue of class imbalance caused by long-tailed data distributions, we leverage the unified metric formulation of SDM to further propose Class-Aware Spectral Distribution Matching (CSDM), which adaptively balances amplitude and phase information based on class imbalance, while enhancing the realism of head classes and preserving the diversity of tail classes. Overall, our proposed SDM and CSDM not only provide a principled rethinking of distribution matching from the spectral perspective, but also introduce a novel class-aware mechanism that addresses the often-overlooked challenge of long-tailed distributions in dataset distillation. By bridging theoretical insights with algorithmic efficiency, our methods consistently deliver excellent performance across both standard and long-tailed benchmarks.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1965
Loading