Clustering is an essential task in data analysis where one seeks to partition the observations of a dataset into $K$ clusters. Due to its ill-posed nature, the design of a clustering algorithm requires hypotheses about what defines good clusters. Different hypotheses may lead to different clusters. In other words, cluster definition and methodology must be adapted to the context in which they are applied~\citep{hennig_what_2015}.

Evaluating the quality of a clustering model is a complex problem that requires appropriate metrics. In an experimental setting, synthetic data can be generated according to hypotheses about the definition of clusters, allowing verification that a clustering algorithm recovers the expected partition. This verification can be done using \emph{external} metrics such as the (unsupervised) accuracy, the normalised mutual information (NMI), or the adjusted Rand index (ARI, \citealp{hubert_comparing_1985}). Conversely, in an exploratory context, \ie when no labels are available, we rely on \emph{internal} metrics that depend solely on the data observations and the model predictions, \eg the variance-ratio criterion~\citep{calinski_dendrite_1974},  the silhouette score~\citep{rousseeuw_silhouettes_1987}, or the integrated complete likelihood~\citep{biernacki_assessing_2000}. Internal metrics are often built with a specific view on clustering hypotheses, and therefore must be used with algorithms that match those hypotheses.

Despite the large number of clustering metrics~\citep{desgraupes_clustering_2013, charrad_nbclust_2014}, there are few metrics that are suitable for comparing clustering models with different clustering hypotheses. In addition, and to the best of our knowledge, there is no clustering metric that can integrate constraints when targets are partially observed.

To compare different clustering algorithms independently of their clustering hypotheses, we take inspiration from consensus clustering~\citep{strehl_cluster_2002} and rank clustering algorithms according to their proximity to a consensus matrix. Our underlying hypothesis is that a diverse set of clustering algorithms will shed light on clusters whose observations are more frequently connected.
Our contributions are:

\begin{itemize}
    \item The proposal of a simple-to-compute and fast score for ranking clustering algorithms based on consensus clustering that is compatible with pairwise constraints regularisations.
    \item The first exploration of clustering ensembles as a mean of performing model selection, both for our metric and some baselines.
    \item An extensive benchmark including synthetic and real data for several internal metrics showing strong performances in favour of our metric.
\end{itemize}