Global-Local Dirichlet Processes for Clustering Grouped Data in the Presence of Group-Specific Idiosyncratic Variables
Abstract: We consider the problem of clustering grouped data for which the observations may include group-specific variables in addition to the variables that are shared across groups. This type of data is quite common; for example, in cancer genomic studies, molecular information is available for all cancers whereas cancer-specific clinical information may only be available for certain cancers. Existing grouped clustering methods only consider the shared variables but ignore valuable information from the group-specific variables. To allow for these group-specific variables to aid in the clustering, we propose a novel Bayesian nonparametric approach, termed global-local (GLocal) Dirichlet process, that models the "global-local" structure of the observations across groups. We characterize the GLocal Dirichlet process using the stick-breaking representation and the representation as a limit of a finite mixture model. We theoretically quantify the approximation errors of the truncated prior, the corresponding finite mixture model, and the associated posterior distribution. We develop a fast variational Bayes algorithm for scalable posterior inference, which we illustrate with extensive simulations and a TCGA pan-gastrointestinal cancer dataset.
Lay Summary: In many real-world studies, data is collected from different groups that share some common information across the groups, but each group may also be accompanied by its own unique information. For example, in cancer research, we might have genetic data shared across all cancer types, while certain clinical details might only be available for specific types of cancer.
Current methods that group or "cluster" such data usually only look at the shared or global information and ignore the group-specific or local features, which can lead to less accurate results. To address this, we introduce a new statistical method called the global-local (GLocal) Dirichlet process, which can handle both shared and group-specific information. This method is based on a flexible Bayesian approach that does not require fixing the number of clusters in advance. It captures the structure of the data at both the global (shared) and local (group-specific) levels.
In our paper, we explain how this method works under the hood, show how it can be approximated for practical purpose, and provide mathematical guarantees for the accuracy of approximation. Finally, we develop a fast algorithm to apply our proposed method to large datasets. We show that our proposed method and algorithm works well using simulations and a real cancer data from a large public database.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/Arhit-Chakrabarti/GLocalVI
Primary Area: General Machine Learning->Clustering
Keywords: Bayesian nonparametrics, clustering, group-specific variables, approximation error, variational Bayes
Submission Number: 7348
Loading