The Skyline Operator to Find the Needle in the Haystack for Automated Clustering

Georg Stefan Schlake, Christian Beecks

Published: 01 Jan 2024, Last Modified: 01 Apr 2025IEEE Big Data 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The analysis of big datasets is a challenging task. While many data scientists are working in the field of supervised data analysis, there is also a growing demand in the field of unsupervised data analysis, such as clustering. To come up with a solution for this, multiple AutoML approaches for clustering have been proposed. However, most of these approaches try to find the "best" clustering, ignoring the subjective nature of the clustering task. A domain expert, however, might be able to identify an appropriate clustering for his/her application in a small set of clusterings, which have been generated, even if he/she is not capable of creating these clusterings by themselves. To enable domain experts to identify valuable clusterings without becoming an expert in clustering as well, we propose to generate multiple clusterings via AutoML processes and to return a selection of clusterings, from which the user can select the most preferred one. We will investigate the use of the Skyline Operator in this use case, to prune clusterings, which are likely useless, and to find a number of clusterings, which are usable for domain experts. We will investigate, how many clusters can be pruned this way and how many valuable clusters get falsely pruned. Our empirical investigation is carried out on a number of synthetic datasets, where a known ground truth can proxy for the wishes of a domain expert and multiple properties of the clusterings can be known beforehand.