Keywords: clustering, PTAS
TL;DR: We proposed a polynomial-time approximation scheme for (k, z)-clustering (a generalization of k-means) on heavily skewed distributions, e.g., Zipfian distributions.
Abstract: In this paper, we tackle the problem of $(k,z)$-clustering, a generalization of the well-known $k$-means, $k$-medians and $k$-medoids problems that is known to be APX hard, i.e., impossible to approximate within a multiplicative factor of $1.06$ in polynomial time for $n$ and $k$ unless P=NP. Due to the APX-hardness, the fastest $(1+\varepsilon)$-approximation scheme proposed by Feldman et al. (2007), exhibits a run time with a polynomial dependency on $n$, but an exponential dependency $2^{\tilde{\mathcal{O}}(k/\varepsilon)}$ on $k$. We observe that a $(1+\varepsilon)$-approximation in truly polynomial time is feasible if the data sets exhibit sufficiently skewed distributions. Indeed in practical scenarios, data sets often exhibit a heavy skewness, leading to the overall clustering cost disproportionately dominated by a few clusters. We propose a novel algorithm that adapts the traditional local search technique to effectively manage $(s, 1- \varepsilon^{z+1})$-skewed datasets with a run time of $(nk/\varepsilon)^{\mathcal{O}(s+1/\varepsilon)}$ for discrete case and $\tilde{\mathcal{O}}(nk) + (k \log n)^{\tilde{\mathcal{O}}(s+1/\varepsilon)}$ for continuous case. Our method is particularly effective with Zipfian distributions with exponent $p>1$, where $s = \mathcal{O}\left(\frac{1}{\varepsilon^{(z+1)/(p-1)}}\right)$.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5239
Loading