Rethinking Dataset Quantization: Efficient Coreset Selection via Semantically-Aware Data Augmentation

TMLR Paper7034 Authors

16 Jan 2026 (modified: 18 Jan 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Coreset selection aims to reduce the computational burden of training large-scale deep learning models by identifying representative subsets from massive datasets. However, existing state-of-the-art methods face a fundamental accessibility dilemma: they either require extensive training on the target dataset to compute selection metrics, or depend heavily on large pre-trained models, undermining the core purpose of coreset selection in resource-constrained scenarios. Dataset Quantization (DQ) avoids full dataset training but relies on expensive pre-trained models, introducing computational overhead and domain-specific biases that limit generalization. In this work, we comprehensively redesign the DQ framework to establish a truly accessible, theoretically sound, and domain-agnostic paradigm for coreset selection. Through rigorous analysis, we identify that: (1) MAE functions primarily as biased data augmentation leveraging memorized ImageNet patterns; (2) MAE benefits ImageNet-related datasets but harms out-of-distribution performance; (3) the original pipeline suffers from feature inconsistency between selection and training phases. We propose DQ_v2, which: (1) eliminates pre-trained model dependencies via Semantically-Aware Data Augmentation (SDA) using randomly initialized CNNs; (2) restructures the pipeline by performing augmentation before selection, ensuring feature consistency. Extensive experiments demonstrate that DQ_v2 achieves superior performance across diverse domains (such as ImageNet-1k, CUB-200, Food-101, and medical imaging) while reducing computational costs by 75% in the augmentation phase, establishing a robust and practical solution for resource-constrained scenarios.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Han-Jia_Ye1
Submission Number: 7034
Loading