For our experiments, we incrementally moved from synthetic partitions to datasets and constraints integration. We first show that the DISCOTEC and other ensemble baselines perform on par on synthetic cases. We then introduce datasets and test both clustering algorithms with a fixed number of clusters or an unrestricted number of clusters. We highlight that the DISCOTEC has strong performances for the latter. Finally, we show that constraints can enhance the performance of DISCOTEC on real datasets, even with few constraints.

\subsection{General protocol}

To evaluate the DISCOTEC, we have borrowed the methodology of \citet[section 4]{vendramin_relative_2010}. We first select a pool of $N_\mathcal{D}$ datasets, and for each dataset we apply $T$ clustering algorithms. We then evaluate the correlation between an internal metric of interest and an external metric that describes how well a model matches some targets. A higher correlation value indicates that the ranking proposed by the internal metric is efficient in identifying the most relevant clustering. We report the average correlations over the $N_\mathcal{D}$ datasets. Note that we have negated all scores that should be minimised, so a positive correlation means good performance. We chose to show the Kendall's tau correlation~\citep{kendall_treatment_1945} in the paper because it measures how well two rankings compare. For extended results, including the Pearson correlation as originally proposed by \citet{vendramin_relative_2010}, see Appendix~\ref{app:extended_benchmark}.

We distinguish two types of baselines: internal metrics that are also based on ensemble clustering and internal metrics that evaluate models individually using distances between observations. For the former, we use the ANMI and AARI, and emphasise that this is the first time the ranking properties of these metrics is studied. For the latter, we used clustering metrics available in the permetrics Python library~\citep{thieu_permetrics_2024} and implemented ourselves some from the clusterCrit package~\citep{desgraupes_clustering_2013}. For the sake of clarity, we have restricted all figures and tables to the ensemble metrics and the top-performing distance-based metrics where relevant. Extended tables with the 20 baselines can be found in Appendix~\ref{app:extended_benchmark}. Code can be found at: \url{https://github.com/oshillou/Discotec}.

\subsection{Synthetic partitions}


\input{figures/fig_synthetic_partitions_correlations}
\input{figures/fig_synthetic_pivot}


We started by evaluating the DISCOTEC with synthetic partitions, which allowed us to control the difficulty of the consensus. We started by generating a ground truth of $n$ observations and $K$ clusters, then generated $T$ different partitions trying to imitate the ground truth with some controlled accuracy. To that end, we sample for each observation a conservation indicator according to some probability $\rho$. If the observation is conserved, it keeps the same cluster as the ground truth. Otherwise, it is assigned to a different cluster than the ground truth.

We tested two synthetic scenarios: one with a uniform distribution of accuracies to the ground truth and one with unbalanced accuracies. For the first scenario, we uniformly sample a conservation threshold $\rho^t \in [0.1, \rho_\text{max}]$ for each model. This ensures that the models have a minimum accuracy of 10\%, and an average maximum accuracy of $\rho_\text{max}$. For the second scenario, we first sample two partitions called \emph{hubs}: one with $\rho=0.2$ and one with $\rho=0.9$. Then, we sample a fraction $\alpha T$ of the models with an accuracy in the range $[0.2, 0.9]$ to the first hub, and the remaining $(1-\alpha)T$ models with identical accuracy range to the second hub.

Since both of these scenarios do not have any underlying data samples on which we can measure distances, we can only evaluate the AARI, the ANMI and the DISCOTEC. We ran both scenarios with $n=200$ samples and $K=10$ clusters. In the first scenario, we varied $T$ from 5 to 50 models. In the second scenario, we fixed $T=50$. Each simulation was repeated 50 times. The results of the first scenario are shown in Figure~\ref{fig:synthetic_partitions_correlation} and of the second scenario in Figure~\ref{fig:synthetic_partitions_pivot}. For the sake of readability, we only report the DISCOTEC with KL divergence and with binarised consensus in the figures, as the rankings using total variation distance and squared Hellinger distance followed the KL curve perfectly.

From the first scenario, we observe that increasing the maximum possible accuracy with $\rho_\text{max}$ increases the correlation of the ranking. Indeed, when the maximum accuracy is low, most of the synthetic partitions tend to disagree with each other, resulting in a noisy consensus. Consequently, no pattern can emerge from the consensus matrix, and the DISCOTEC fails to correctly identify the correct clusterings. In contrast, if the maximum accuracy is high, a pattern can be seen in the consensus matrix, and the ranking can be coherent with this pattern. For completeness, we have included examples of such matrices in Appendix~\ref{app:consensus_visualisation}. We can note in Figure~\ref{fig:synthetic_partitions_correlation} that the number of models is crucial to improve the performance of both baselines and DISCOTEC. Indeed, the correlation between the ARI of the partitions on the targets and the ranking of each method increases, and its standard deviation decreases from 5 to 50 models. This effect is even stronger for the binarised DISCOTEC. We further discuss and experiment with scaling within this scenario in Appendix~\ref{app:scaling}.

The success of the first scenario is due to the uniform distribution in terms of accuracy of all sampled models, but does not transfer to the second scenario. The second scenario highlights that both the DISCOTEC and the baselines are attracted to dominant hubs in terms of clustering solutions. Indeed,  we can see in Figure~\ref{fig:synthetic_partitions_pivot} that when the partitions are close to a solution with high accuracy, \ie $\alpha\approx 0$, then the ranking has a high correlation with the ARI. Conversely, increasing the number of models that are similar to a poor solution with very low accuracy, \ie $\alpha=1$, decreases the correlation for the same reason of noisy patterns as described above.

In summary, we have shown with these synthetic scenarios that ranking according to the relationships between models, both in baselines and the DISCOTEC, depends on two main factors: (i) the number of models and (ii) the distribution of the clustering ARIs. The number of models should preferably be large enough. However, too large a number of models is detrimental to the AARI and ANMI, which scale quadratically while the DISCOTEC scales linearly. Regarding the distribution of the clustering ARIs, we can expect better performance if it is more concentrated on solutions that are close enough to the ground truth and uniform. In other words: the diversity of base clusterings matters, in the sense of different cluster definitions.

\subsection{Synthetic and real datasets clustering}

To simulate more complex distributions of ARI with respect to targets, we now turn to different combinations of clustering models and datasets. We considered two different categories of datasets for our experiment: the fundamental clustering problem suite (FCPS, \citealp{thrun_fundamental_2021}) and real datasets from the UCI repository, summarised in Appendix~\ref{app:benchmark_details}. The FCPS consists of different simulated datasets in two or three dimensions, so that the definition of clusters is consensual to the naked eye. In contrast, the UCI datasets are intended for classification, which means that the classes and their number may not reflect the clusters and their number. Therefore, we must be careful in our interpretation of the ARI depending on the category of the datasets.

\subsubsection{Restriction to a fixed number of clusters}

\input{figures/table_benchmark_fixed_kendall_restricted}
\input{figures/table_benchmark_fixed_regretari_restricted}

Similarly to the synthetic scenarios, we restrict our experiments to clustering models that must find as many clusters as the number of clusters (resp. classes) indicated by the targets of the FCPS (resp. UCI) datasets. We try two different algorithms: KMeans and agglomerative clustering. We run KMeans 50 times. Since agglomerative clustering deterministically produces the same clustering, we vary its parameters using single, average, complete, and Ward linkage, and also Euclidean or Manhattan distance. This results in 7 models, because the Manhattan distance and the Ward linkage are incompatible.

Following the general protocol, we report the correlation in Table~\ref{tab:benchmark_fixed_kendall_restricted}. Since the average correlation can be high due to some lucky runs, we extend our results by also reporting the regret score on the ARI of the top-ranked model for all methods. We define the regret score as the difference between the best performance of all methods and the performance of one method, which we average over all datasets. A lower regret score is better, and a regret score of 0 indicates that the method always had the best performance. We report the regret scores in Table~\ref{tab:benchmark_fixed_regretari_restricted}. Regret scores on the correlation can be found in Appendix~\ref{app:extended_benchmark}.

These results complement the observations made previously in our second synthetic scenario. Indeed, we can see in Table~\ref{tab:benchmark_fixed_regretari_restricted} that the clusterings proposed by the agglomerative algorithms are more diverse than for KMeans since the ARI regret is up to 38\% behind for some scores. This diversity leads to higher correlations compared to KMeans algorithms in Table~\ref{tab:benchmark_fixed_kendall_restricted}. In contrast, the KMeans algorithms were attracted to specific clusterings that had a low ARI with respect to the targets for some datasets, leading to lower correlations. Furthermore, the lack of diversity between the base clusterings  and the regret on the selected model ARI is similar for all scores.

Among the compared baselines, we do not distinguish any score that offers a better ranking than any other for this experiment. We only mention both the silhouette index (SI, \citealp{rousseeuw_silhouettes_1987}) as an example and the strong success of the Calinski-Harabasz index (CHI, \citealp{calinski_dendrite_1974}) for the UCI datasets with agglomerative clustering, highlighted by a high correlation in Table~\ref{tab:benchmark_fixed_kendall_restricted} and a regret score of 0 on the selected model ARI in Table~\ref{tab:benchmark_fixed_regretari_restricted}.


\subsubsection{Unrestricted pool of clustering models}
\input{figures/table_benchmark_mixed_kendall_restricted}
\input{figures/table_benchmark_mixed_regretari_restricted}

We now extend the previous experiments by proposing a more diverse pool of clustering algorithms. We run KMeans clustering with $K$ varying from 2 to 20 for each dataset, 5 times per value of $K$. We run agglomerative clustering with the same linkage parameters as before with Euclidean distances for 2 to 20 clusters. We then add DBSCAN models with parameter epsilon varying from the 1\% quantile of the Euclidean distances of the dataset to the 25\% quantile. We discard degenerate clusterings. Finally, we also evaluate the performance of the ranking methods when we merge all the models.


We observe in Table~\ref{tab:benchmark_mixed_kendall_restricted} that the average correlation is the highest for the DISCOTEC with binarised consensus matrix. Moreover, the ARI regret of the selected model is also the lowest in Table~\ref{tab:benchmark_mixed_regretari_restricted}, which reveals better selection. In contrast, the DISCOTEC using the KL divergence did not perform better than the AARI baselines.

The performance of the DISCOTEC with KL divergence suffered precisely from the overclustering bias that motivated the introduction of the binarised consensus. The high proportion of models with a large number of clusters contributed to lowering the values of the consensus matrix, bringing them all close to 0. The KL ranking consequently favoured models with the largest number of clusters because they are less penalised when connecting as few observations as possible.

Finally, there is a notable difference between the FPCS and UCI datasets. For the former, we have the certainty that at least one of the proposed algorithms will achieve an ARI of 1 when merged together,  because it matches the empirical definition of the clusters in a dataset. In contrast, the latter does not guarantee that the targets reflect clusters. Therefore, it is likely that the set of clustering algorithms will point to different clusters than the targets, sometimes with a different number of clusters compared to the number of classes. This accounts for the lower correlation values for the UCI datasets compared to the FCPS datasets in Table~\ref{tab:benchmark_mixed_kendall_restricted}.

\subsection{Constraint integration}
\input{figures/fig_regularisation}

To make the most sense of the correlation between targets and clusters in the UCI datasets, we add constraints to the ranking. Thus, we simultaneously look for a model that captures clusters that correspond to what most models find, while also respecting the classes as much as possible.

We measured constraint satisfaction using the approximate measure of informativeness $\mathcal{R}$, and added it to both the DISCOTEC and the AARI/ANMI baselines. We chose not to add it to distance-based metrics because they do not correspond to the approximate measure of informativeness in terms of dimensional analysis. Moreover, the constraint regularisation is bounded in [0,1], whereas some other metrics are unbounded.


To assess the benefit of constraints, we report for each dataset the initial unconstrained correlation between targets and rankings using the same models from the previous experiment, and report the correlation after adding constraints. For each dataset, we randomly selected $n$ observations and generated all must-link and cannot-link constraints they implied using the targets, and then evaluated the correlation of the regularised rankings. We repeated the constraint addition 50 times. Correlations before and after constraint addition can be found in Figure~\ref{fig:benchmark_correlation_gain}.

We observe that the addition of constraints is rarely detrimental to the correlation, as highlighted by the standard deviation decreasing from 0 constrained observations to 5 constrained observations. Moreover, increasing the number of constraints increases the correlation of the ranking with the targets. However, this increase is more substantial when introducing the first few constraints and tends to flatten afterwards. Nonetheless, we may note that 50 constrained observations is relatively small for some UCI datasets , \eg Segmentation with 2310 observations.

