Beyond Model-Centric: Collaborative Data Optimization for Reusing and Sharing

Xinyi Shang; Peng Sun; Fengyuan Liu; Tao Lin

Beyond Model-Centric: Collaborative Data Optimization for Reusing and Sharing

Xinyi Shang, Peng Sun, Fengyuan Liu, Tao Lin

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient Deep Learning, Sustainable Deep Learning, Data-centric, Unlabeled Data

TL;DR: We pioneer a novel data-centric perspective for collaborative unlabeled data optimization, demonstrating that the optimized data can achieve strong efficacy, efficiency, and reusability across various datasets and architectures.

Abstract: This paper pioneers a \textit{novel data-centric paradigm} to maximize the utility of unlabeled data, tackling a critical question: \emph{How can we enhance the sustainability and efficiency of deep learning training by optimizing the data itself?} We begin by identifying two key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability. To this end, we propose \algopt, a highly efficient, parallelized framework for collaborative unlabeled data optimization. By distributing unlabeled data and leveraging publicly available task-agnostic prior models, \algopt optimizes raw unlabeled data into knowledge-enriched training sets that are effective, efficient, reusable, and easily shareable. Extensive experiments across diverse datasets and architectures validate these advantages, achieving a 7.9\% improvement on ImageNet-1K over BYOL. Notably, \algopt remains effective even when all prior models are significantly weak, substantially accelerating the early stages of training. These results establish data-centric optimization as a promising path toward sustainable and efficient deep learning. Our code is provided in the Supplementary Materials and will be publicly accessible.

Supplementary Material: zip

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 4346

Loading