A note on the $k$-means clustering for missing data

TMLR Paper4627 Authors

07 Apr 2025 (modified: 22 Apr 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The classical $k$-means clustering algorithm requires complete data and cannot be directly applied when observations contain missing entries. An intuitive and computationally efficient extension addresses this issue by minimizing the $k$-means loss over the observed entries only, a strategy considered in several studies. This method is known as $k$-POD clustering. In this paper, we provide a theoretical analysis of this approach and demonstrate that it is generally inconsistent, even under the missing completely at random (MCAR) assumption. Specifically, we show that the expected loss being minimized asymptotically differs from the original $k$-means objective, leading to biased estimates of cluster centers in the large-sample limit. This highlights a fundamental limitation: the method may fail to recover the true underlying cluster structure, even in settings where $k$-means performs well on fully observed data. Nevertheless, when the missing rate per variable is sufficiently low and the dimensionality is high, the method can still produce stable and practically useful results, making it a viable alternative when the complete-case analysis is ineffective.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lei_Wang13
Submission Number: 4627
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview