A Beyond-Worst-Case Analysis of Greedy k-means++

Qingyun Chen; Sungjin Im; Benjamin Moseley; Ryan Milstrey; Chenyang Xu; Ruilong Zhang

A Beyond-Worst-Case Analysis of Greedy k-means++

Qingyun Chen, Sungjin Im, Benjamin Moseley, Ryan Milstrey, Chenyang Xu, Ruilong Zhang

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: k-means, clustering, approximation algorithms

Abstract: $k$-means++ and the related greedy $k$-means++ algorithm are celebrated algorithms that efficiently compute seeds for Lloyd's algorithm. Greedy $k$-means++ is a generalization of $k$-means++ where, in each iteration, a new seed is greedily chosen among multiple $\ell \geq 2$ points sampled, as opposed to a single seed being sampled in $k$-means++. While empirical studies consistently show the superior performance of greedy $k$-means++, making it a preferred method in practice, a discrepancy exists between theory and practice. No theoretical justification currently explains this improved performance. Indeed, the prevailing theory suggests that greedy $k$-means++ exhibits worse performance than $k$-means++ in worst-case scenarios. This paper presents an analysis demonstrating the outperformance of the greedy algorithm compared to $k$-means++ for a natural class of well-separated instances with exponentially decaying distributions, such as Gaussian, specifically when $\ell = \Theta(\log k)$, a common parameter setting in practical applications.

Supplementary Material: zip

Primary Area: Optimization (e.g., convex and non-convex, stochastic, robust)

Submission Number: 21728

Loading