Automated Exploratory Clustering

Published: 01 Jan 2024, Last Modified: 01 Apr 2025IEEE Big Data 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Clustering is a frequently encountered task in big data analytics, where the goal is to simultaneously group and separate similar and dissimilar objects, respectively. It is also a well known fact, that clustering has a highly subjective nature, in the sense that determining the best clustering is highly dependent on the application setting. Though the recently established research direction of Automated Clustering has originated different algorithmic solutions to the clustering problem, these approaches assume a defined clustering evaluation metric to be optimized. These approaches thus inherently assume that such a thing like a single best clustering exists, which is not always true in real applications where insight into the data comes when inspecting the resulting clusterings.In order to maximize the insight for a data scientists or a domain specialist, we propose to not solely investigate a single best clustering but instead to explore multiple best clusterings according to different evaluation criteria. This will not only help to identify several clusters of interest to the user, but also to maximize the impact gained from following different evaluation criteria. In this paper, we hence propose the concept of Automated Exploratory Clustering, which follows the idea of automatically providing the best clusterings for further exploration. To this end, we formalize the problem of Automated Exploratory Clustering and define a theoretic framework comprising necessary formulations. In addition, we propose an efficient algorithm to compute the most interesting clusterings and benchmark its effectiveness and efficiency. Our approach will help domain experts without expertise in clustering to gain new insights in their datasets and serves as a baseline for future research.
Loading