Evaluating the quality of a clustering model is a challenging task. In fact, the absence of a formal definition of what a cluster is leads to the absence of a definition of what quality is, and finding an objective measure of quality that allows comparison of different algorithms is challenging~\citep{boley_partitioning-based_1999}. For example, \citet{han_10_2012} define quality as ``\textit{[s]ome methods [that] measure how well the clusters fit the data set, while others measure how well the clusters match the ground truth}''. These two categories can be referred to as internal metrics and external metrics.

\subsection{External indices}

In the presence of ground truth, \ie targets, we can evaluate clustering models using external indices. Common evaluation metrics include the (unsupervised) accuracy, the NMI~\citep{strehl_cluster_2002} or the ARI~\citep{hubert_comparing_1985}. It is worth noting that NMI and ARI are preferable to unsupervised accuracy when the number of clusters in a model differs from the number of clusters in the targets, which is unknown in practice.

When we partially observe the targets, we can still use them as constraints, as in semi-supervised clustering~\citep{bair_semi-supervised_2013, cai_review_2023}. Two types of constraints can be distinguished: labels and must-link or cannot-link constraints. The former explicitly assign samples to a cluster, whereas the latter only indicate whether samples should be together or in different clusters, regardless of the cluster membership. Labels imply must-link and cannot-link constraints, but the reverse is not true. For example, we can evaluate the model using the pairwise recall, precision and F-measure~\citep{basu_active_2004}, or the constrained Rand index (CRI, \citealp{klein_instance-level_2002}). For both measures, the evaluation is restricted to the set of samples (or set of sample pairs) that are not affected by constraints. These metrics are external and require ground truth. Consequently, they cannot be used if we do not have access to labels other than those used to constrain the clustering algorithm. In the absence of such information, the approximate measure of informativeness~\citep{davidson_measuring_2006} could be preferred: it is simply the average number of constraints not satisfied by a clustering algorithm.

\subsection{Internal indices}

We can distinguish two types of internal indices: those that integrate the clustering hypotheses from the model, and those that carry their own hypotheses about the definition of what makes good clusters.

If a model defines a tractable likelihood, we can use this value to reflect the fit between the model and the data. For example, the Akaike information criterion (AIC, \citealp{akaike_new_1974, akaike_information_1998}) penalises the likelihood by the complexity of the model, expressed in terms of the number of free parameters. The Bayesian information criterion (BIC; \citealp{schwarz_estimating_1978}) weights this parameter penalty by the logarithm of the number of training samples. The integrated complete likelihood (ICL, \citealp{biernacki_assessing_2000}) extends the BIC in model-based clustering by distinguishing between model components in model-based clustering and their correspondence to clusters using an entropy penalty term.

If a model does not define a tractable likelihood, we may not have access to a fitness measure from the model and have to construct it post hoc. This is notably the case for discriminative clustering models, \eg KMeans or DBSCAN~\citep{ester_density-based_1996}. In this sense, the most well-known internal metric is the within-group sum of squares (WGSS, \citealp{edwards_method_1965}), also known as the KMeans score. This score is efficient for clusters that are assumed to be concentrated around a centroid. However, ensuring that samples are concentrated around a centroid is not enough; a clear separation between clusters is also a desirable property. From this desire come criteria such as  the variance-ratio criterion~\citep{calinski_dendrite_1974},  the Dunn index~\citep{dunn_well-separated_1974} and its generalisations~\citep{bezdek_new_1998}, which include the Davies-Bouldin index~\citep{davies_cluster_1979}, the Silhouette score~\citep{rousseeuw_silhouettes_1987} or the PBM index~\citep{pakhira_validity_2004}. Internal metrics comparing the coherence of pairwise clustering have also been proposed. For instance, the Gamma index~\citep{baker_measuring_1975} and the G+ index~\citep{rohlf_methods_1974} are based on the notion of discordant and concordant pairs. A pair of samples from a similar cluster is concordant with another pair of samples from different clusters when their distance is shorter than the second pair. The two pairs are discordant if the distance is greater for dissimilar clusters than for similar clusters. To alleviate the requirement on the choice of distances, some of these scores were adapted for connectivity matrices~\citep{saha_connectivity_2012} derived from relative neighbourhood graphs~\citep{toussaint_relative_1980}.

\subsection{Ranking models in consensus clustering}

If we restrict the goal of a clustering metric to comparing models, then the most relevant property is the ability of a metric to \emph{rank} algorithms well. In a ranking context, we necessarily have several models: this allows us to use ensemble methods. Consensus clustering is an unsupervised ensemble clustering method that stems from classification ensembles~\citep{strehl_cluster_2002}. The goal is to use several clustering algorithms, called base clusterings, and to combine their results into a single final clustering using a consensus function. Combining the results thus increases the quality of the clustering, in the sense of an evaluation using external labels.

Several works then developed some filtering criteria on the base clusterings to improve the quality of consensus clustering. This field, sometimes called \emph{ensemble clustering selection}~\citep{golalipour_clustering_2021}, focuses on selecting a subset of base clusterings based on the belief that some of the base models hinder the global quality and should be discarded. The goal is to keep clusterings of quality while maintaining some diversity~\citep{kuncheva_using_2004, hadjitodorov_moderate_2006, fern_cluster_2008}. This selection can be done by keeping the base clusterings that are closest to the consensus result~\citep{hong_resampling-based_2009, azimi_adaptive_2009, jia_bagging-based_2011}. Although this introduces ordering between models, this ordering is non-deterministic as it relies on the outcome of the consensus which can be stochastic. Selection can also be achieved by solving a K-vertex subgraph problem on a graph, where edges are the similarities between pairs of base clusterings~\citep{fern_cluster_2008, yang_cluster_2017}. However, such an approach does not introduce an order between models.

In some cases, this selection is made thanks to a ranking. Often, this ranking is done by interpolating between the quality and the diversity of each clustering algorithm~\citep{fern_cluster_2008, naldi_cluster_2013, wang_two-level-oriented_2018}. The ranking then depends simultaneously on the definition of what is the quality of a base clustering and what diversity represents, in an internal metric sense, and on the interpolation coefficient. Thus, metrics for ranking clustering algorithms are not new, \emph{but their purpose is different}. Therefore, and to the best of our knowledge, ranking clustering algorithms through consensus has never been used as a metric for selecting clustering models.