Keywords: Efficient Deep Learning, Sustainable Deep Learning, Data-centric, Unlabeled Data
TL;DR: We pioneer a novel data-centric perspective for collaborative unlabeled data optimization, demonstrating that the optimized data can achieve strong efficacy, efficiency, and reusability across various datasets and architectures.
Abstract: This paper pioneers a \textit{novel data-centric paradigm} to maximize the utility of unlabeled data, tackling a critical question: \emph{How can we enhance the sustainability and efficiency of deep learning training by optimizing the data itself?}
We begin by identifying two key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability.
To this end, we propose \algopt, a highly efficient, parallelized framework for collaborative unlabeled data optimization.
By distributing unlabeled data and leveraging publicly available task-agnostic prior models, \algopt optimizes raw unlabeled data into knowledge-enriched training sets that are effective, efficient, reusable, and easily shareable.
Extensive experiments across diverse datasets and architectures validate these advantages, achieving a 7.9\% improvement on ImageNet-1K over BYOL.
Notably, \algopt remains effective even when all prior models are significantly weak, substantially accelerating the early stages of training. These results establish data-centric optimization as a promising path toward sustainable and efficient deep learning. Our code is provided in the Supplementary Materials and will be publicly accessible.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 4346
Loading