Uniform and Non-uniform Sampling Methods for Sub-linear Time k-means Clustering

Yuanhang Ren, Ye Du

Published: 01 Jan 2020, Last Modified: 14 May 2023ICPR 2020Readers: Everyone

Abstract: The k -means problem is arguably the most well-known clustering problem in machine learning, and lots of approximation algorithms have been proposed for it. However, many of these algorithms may become infeasible when data is huge. Sub-linear time algorithms with constant approximation ratios are desirable in this scenario. In this paper, we first improve the analysis of the algorithm proposed by [1] by sharpening the approximation ratio from 4( α + β ) to α + β . Moreover, on mild assumptions of the data, a constant approximation ratio can be achieved in poly-logarithmic time by the algorithm. Furthermore, we propose a novel sub-linear time clustering algorithm called Double -K-MC2 sampling as well. Experiments on the data clustering task and the image segmentation task have validated the effectiveness of our algorithms.

0 Replies