Coresets for Clustering with Noisy Data

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: clustering, noise, coreset
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We study the problem of data reduction for clustering in the presence of stochastic noise, propose a measure that better captures the quality of a coreset for this setting and show its effectiveness theoretically and empirically.
Abstract: We study the problem of data reduction for clustering when the input dataset $\widehat{P}$ is a noisy version of the true dataset $P$. Motivation for this problem derives from settings where data is obtained from inherently noisy measurements or noise is added to data for privacy or robustness reasons. In the noise-free setting, coresets have been proposed as a solution to this data reduction problem -- a coreset is a subset $S$ of $P$ that comes with a guarantee that the maximum difference, over all center sets, in cost of the center set for $S$ versus that of $P$ is small. We find that this well-studied measure which determines the quality of a coreset is too strong when the data is noisy because the change in the cost of the optimal center set in the case $S=\widehat{P}$ when compared to that of $P$ can be much smaller than other center sets. To bypass this, we consider a modification of this measure by 1) restricting only to approximately optimal center sets and 2) considering the *ratio* of the cost of $S$ for a given center set to the minimum cost of $S$ over all approximately optimal center sets. This new measure allows us to get refined estimates on the quality of the optimal center set of a coreset as a function of the noise level. Our results apply to a wide class of noise models satisfying certain bounded-moment conditions that include Gaussian and Laplace distributions. Our results are not algorithm-dependent and can be used to derive estimates on the quality of a coreset produced by any algorithm in the noisy setting. Empirically, we present results on the performance of coresets obtained from noisy versions of real-world datasets, verifying our theoretical findings and implying that the variance of noise is the main characterization of the coreset performances.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5866
Loading