Model-Based Clustering and Variable Selection for Multivariate Count Data

Julien JACQUES; Thomas Brendan Murphy

Model-Based Clustering and Variable Selection for Multivariate Count Data

Julien JACQUES, Thomas Brendan Murphy

Published: 12 May 2025, Last Modified: 12 May 2025Accepted by ComputoEveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Model-based clustering provides a principled way of developing clustering methods. We develop a new model-based clustering methods for count data. The method combines clustering and variable selection for improved clustering. The method is based on conditionally independent Poisson mixture models and Poisson generalized linear models. The method is demonstrated on simulated data and data from an ultra running race, where the method yields excellent clustering and variable selection performance.

Repository Url: https://jujacques.github.io/MultivariateCountData/

Changes Since Last Submission: The paper has been revised according to the reviewers remarks. Point per point answer is given as comments to the reviews. We summarize here the main changes : - we now better explain than a unified Conditionally Independent Poisson Mixture Model is considered, and that the two model M1 and M2 are only used for variables selection, - into the variable selection step, the dependence between the clustering variables, the proposal variables and the other variables has been better explained, - the choice of the number of clusters has been clarified, - the parameters used for simulation are presented, and the presentation of the simulation results has been reorganized, - some extra plots to improve results understanding have been added, both in the simulation study and the real data analysis, - an R package has been built and is now available as all the code used for the paper in the GitHub repository. We also added a pseudo-code for better explaining the stepwise variable selection process.

Assigned Action Editor: ~Pierre_Neuvial1

Submission Number: 14

Loading