Do you know what k-means? Clustering with constant number of samples

Do you know what k-means? Clustering with constant number of samples

ICLR 2026 Conference Submission18893 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: k-means, clustering, quantum algorithms, unsupervised learning

TL;DR: We propose approximate classical and quantum versions of Lloyd's k-means algorithm that require only a constant number of samples

Abstract: Clustering is one of the most important tools for analysis of large datasets, and perhaps the most popular clustering algorithm is Lloyd's algorithm for $k$-means. This algorithm takes $n$ vectors $V=[v_1,\dots,v_n]\in\mathbb{R}^{d\times n}$ and outputs $k$ centroids $c_1,\dots,c_k\in\mathbb{R}^d$; these partition the vectors into clusters based on which centroid is closest to a particular vector. We present a classical $\varepsilon$-$k$-means algorithm that performs an approximate version of one iteration of Lloyd's algorithm with time complexity $\widetilde{O}\big(\frac{\|V\|_F^2}{n}\frac{k^{2}d}{\varepsilon^2}(k + \log{n})\big)$, exponentially improving the dependence on the data size $n$ and matching that of the "$q$-means" quantum algorithm originally proposed by Kerenidis, Landman, Luongo, and Prakash (NeurIPS'19). Moreover, we propose an improved $q$-means quantum algorithm with time complexity $\widetilde{O}\big(\frac{\|V\|_F}{\sqrt{n}}\frac{k^{3/2}d}{\varepsilon}(\sqrt{k}+\sqrt{d})(\sqrt{k} + \log{n})\big)$ that quadratically improves the runtime of our classical $\varepsilon$-$k$-means algorithm in several parameters. Our quantum algorithm does not rely on quantum linear algebra primitives of prior work, but instead only uses QRAM to prepare simple states based on the current iteration's clusters and multivariate quantum mean estimation. Our upper bounds are complemented with classical and quantum query lower bounds, showing that our algorithms are optimal in most parameters. Finally, we conduct numerical experiments that evidence the substantially improved runtime our classical algorithm over the standard Lloyd's algorithm, thus being one of the first cases of a practical dequantised algorithm.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 18893

Loading