Beyond Model-Centric: Collaborative Data Optimization for Reusing and Sharing

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Efficient Deep Learning, Sustainable Deep Learning, Data-centric, Unlabeled Data
TL;DR: We pioneer a novel data-centric perspective for collaborative unlabeled data optimization, demonstrating that the optimized data can achieve strong efficacy, efficiency, and reusability across various datasets and architectures.
Abstract: This paper pioneers a \textit{novel data-centric paradigm} to maximize the utility of unlabeled data, tackling a critical question: \emph{How can we enhance the sustainability and efficiency of deep learning training by optimizing the data itself?} We begin by identifying two key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability. To this end, we propose \algopt, a highly efficient, parallelized framework for collaborative unlabeled data optimization. By distributing unlabeled data and leveraging publicly available task-agnostic prior models, \algopt optimizes raw unlabeled data into knowledge-enriched training sets that are effective, efficient, reusable, and easily shareable. Extensive experiments across diverse datasets and architectures validate these advantages, achieving a 7.9\% improvement on ImageNet-1K over BYOL. Notably, \algopt remains effective even when all prior models are significantly weak, substantially accelerating the early stages of training. These results establish data-centric optimization as a promising path toward sustainable and efficient deep learning. Our code is provided in the Supplementary Materials and will be publicly accessible.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 4346
Loading