On Optimal Coreset Construction for Euclidean (k, z)-Clustering

Lingxiao Huang, Jian Li, Xuan Wu

Published: 01 Jan 2024, Last Modified: 15 Apr 2025STOC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Constructing small-sized coresets for various clustering problems in different metric spaces has attracted significant attention for the past decade. A central problem in the coreset literature is to understand what is the best possible coreset size for (k,z)-clustering in Euclidean space. While there has been significant progress in the problem, there is still a gap between the state-of-the-art upper and lower bounds. For instance, the best known upper bound for k-means (z=2) is min{O(k3/2 ε−2),O(k ε−4)} [Cohen-Addad, Larsen, Saulpic, Schwiegelshohn, Sheikh-Omar, NeurIPS’22], while the best known lower bound is Ω(kε−2) [Cohen-Addad, Larsen, Saulpic, Schwiegelshohn. STOC’22]. In this paper, we make significant progress on both upper and lower bounds. For a large range of parameters (i.e., ε, k), we have a complete understanding of the optimal coreset size. In particular, we obtain the following results: (1) We present a new coreset lower bound Ω(k ε−z−2) for Euclidean (k,z)-clustering when ε ≥ Ω(k−1/(z+2)). In view of the prior upper bound Õz(k ε−z−2) [Cohen-Addad, Larsen, Saulpic, Schwiegelshohn. STOC’22], the bound is optimal. The new lower bound is surprising since Ω(kε−2) [Cohen-Addad, Larsen, Saulpic, Schwiegelshohn. STOC’22] is “conjectured” to be the correct bound in some recent works (see e.g., [Cohen-Addad, Larsen, Saulpic, Schwiegelshohn. STOC’22; Cohen-Addad, Larsen, Saulpic, Schwiegelshohn, Sheikh-Omar, NeurIPS’22]). Our new lower bound instance is a delicate construction with multiple clusters of points, which is a significant departure from the previous construction in [Cohen-Addad, Larsen, Saulpic, Schwiegelshohn. STOC’22] that contains a single cluster of points. The new lower bound also implies improved lower bounds for (k,z)-clustering in doubling metrics. (2) For the upper bound, we provide efficient coreset construction algorithms for (k,z)-clustering with improved or optimal coreset sizes in several metric spaces. In particular, we provide an Õz(k2z+2/z+2 ε−2)-sized coreset, with a unfied analysis, for (k,z)-clustering for all z≥ 1 in Euclidean space. This upper bound improves upon the Õz(k2ε−2) upper bound by [Cohen-Addad, Larsen, Saulpic, Schwiegelshohn. STOC’22] (when k≤ ε−1), and matches the recent independent results [Cohen-Addad, Larsen, Saulpic, Schwiegelshohn, Sheikh-Omar, NeurIPS’22] for k-median and k-means (z=1,2) and extends them to all z≥ 1.