Avoiding Inferior Clusterings with Misspecified Gaussian Mixture Models

TMLR Paper155 Authors

05 Jun 2022 (modified: 28 Feb 2023)Rejected by TMLREveryoneRevisionsBibTeX
Abstract: Gaussian Mixture Model (GMM) is a widely used probabilistic model for clustering. In many practical settings, the true data distribution, which is unknown, may be non-Gaussian and may be contaminated by noise or outliers. In such cases, clustering may still be done with a misspecified GMM. However, this may lead to incorrect classification of the underlying subpopulations. In this paper, we examine the performance of both Expectation Maximization (EM) and Gradient Descent (GD) on unconstrained Gaussian Mixture Models when there is misspecification. Our simulation study reveals a previously unreported class of \textit{inferior} clustering solutions, different from spurious solutions, that occurs due to asymmetry in the fitted component variances. We theoretically analyze this asymmetry and its relation to misspecification. To address the problem, we design a new functional penalty term for the likelihood based on the Kullback Leibler divergence between pairs of fitted components. Closed form expressions for the gradients of this penalized likelihood are difficult to derive but GD can be done effortlessly using Automatic Differentiation. The use of this penalty term leads to effective model selection and clustering with misspecified GMMs, as demonstrated through theoretical analysis and numerical experiments on synthetic and real datasets.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=iBAPX0sMHe
Changes Since Last Submission: Fixed the font in title and section headings to match TMLR template
Assigned Action Editor: ~Cedric_Archambeau1
Submission Number: 155
Loading