What Data-Centric AI Can Do For k-means: a Faster, Robust k-means-d

Parichit Sharma, HASAN KURBAN, Mehmet Dalkilic

Published: 26 Jul 2024, Last Modified: 13 Dec 2024Data-centric Machine Learning Research (DMLR): Datasets for Foundations Models Workshop, Proceedings of the 41 st Interna- tional Conference on Machine Learning (ICML 2024), ViennaEveryoneCC BY-NC 4.0

Abstract: Data-centric AI (DCAI) is an emerging paradigm that prioritizes the quality, diversity, and representation of data over model architecture and hyperparameter tuning. DCAI emphasizes up- stream data operations such as cleaning, balancing, and preprocessing, rather than solely focus- ing on downstream model selection and optimization. This work aims to push DCAI into the model-building phase itself, observing whether benefits downstream can be as significant in a classical, well studied algorithm like k-means. We introduce data-centric k-means (or k-means- d for short). k-means-d is a novel adaptation of k-means clustering that achieves significant speedups while preserving algorithmic accuracy. The key innovation classifies data points as high expressive (HE), impacting the objective function significantly, or low expressive (LE), with minimal influence. By categorizing data points as HE/LE, k-means-d extracts quality signals from data to improve scalability and reduce computational overhead. Comprehensive experimental evaluation demonstrate substantial performance gains of k-means-d over existing alternatives. The novelty lies in the pioneering integration of data-centric principles within a fundamental algorithm’s iterative core. By rethinking k-means through a data lens, k-means-d delivers superior efficiency without sacrificing properties like accuracy and convergence, paving the way for infusing data-centricity into other canonical algorithms.