Rethinking Dataset Quantization: Efficient Coreset Selection via Semantically-Aware Data Augmentation
Abstract: Coreset selection aims to reduce the computational burden of training large-scale deep learning models by identifying representative subsets from massive datasets. However, existing state-of-the-art methods face a fundamental accessibility dilemma: they either require extensive training on the target dataset to compute selection metrics, or depend heavily on large pre-trained models, undermining the core purpose of coreset selection in resource-constrained scenarios. Dataset Quantization (DQ) avoids full dataset training but relies on expensive pre-trained models, introducing computational overhead and domain-specific biases that limit generalization. In this work, we comprehensively redesign the DQ framework to establish a more accessible and domain-robust paradigm for coreset selection. Through rigorous analysis, we identify that: (1) MAE functions primarily as biased data augmentation leveraging memorized ImageNet patterns; (2) MAE benefits ImageNet-related datasets but harms out-of-distribution performance; (3) the original pipeline suffers from feature inconsistency between selection and training phases. We propose DQ_v2, which: (1) eliminates pre-trained model dependencies via Semantically-Aware Data Augmentation (SDA) using randomly initialized CNNs; (2) restructures the pipeline by performing augmentation before selection, ensuring feature consistency. Extensive experiments demonstrate that DQ_v2 achieves superior performance across diverse domains (such as ImageNet-1k, CUB-200, Food-101, and medical imaging) while reducing end-to-end coreset construction cost by 41% on ImageNet-1k (95% in the augmentation phase alone), establishing a robust and practical solution for resource-constrained scenarios.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We address all three reviewers' requested changes. Key revisions include:
(1) new ablation experiments on background types, augmentation ratios, patch granularity, and component ablation (Appendix B);
(2) an end-to-end computational cost breakdown table (Table 1);
(3) Algorithm 1 detailing the mask generation procedure;
(4) reframing of MAE's behavior as local texture interpolation, with explicit attribution to Cao & Wu (2022);
(5) a scope statement clarifying applicability to single-object classification;
(6) weakened overclaims throughout.
All changes are marked in blue.
Assigned Action Editor: ~Han-Jia_Ye1
Submission Number: 7034
Loading