We seek to build a score for ranking clustering algorithms that simultaneously take into account the results of all algorithms and comply with must-link and cannot-link constraints. We start with the general unconstrained score, then detail how it can be simplified thanks to consensus binarising and finish with the addition of constraints.

\subsection{The unconstrained score}

We assume that we have a set of $T\geq 3$ clustering models and a dataset of $n$ unlabelled samples $\mathcal{D} = \{x_i\}_{i=1}^n$. Each clustering model $t$ defines a partition of this dataset into $K^t$ clusters: $\pi^t \in \{1, \ldots, K^t\}^n$. Note that we only consider hard clusterings here, so that our method is compatible with any clustering algorithm, since soft clusterings can always be converted to hard ones.

We construct for each partition its respective connectivity matrix. Its entries are binary values indicating whether two samples were in the same cluster:
\begin{equation}
    \pmb{A}^t = \left[ \mathbbm{1}[\pi^t_i = \pi^t_j]\right].
\end{equation}
We can then build the consensus matrix \citep{monti_consensus_2003} by averaging all connectivity matrices:
\begin{equation}
    \pmb{C} = T^{-1} \sum_{t=1}^T \pmb{A}^t.
\end{equation}
The entries of the consensus matrix can be interpreted as parameters of Bernoulli distributions: they describe the probability that two samples end up in the same cluster according to the ensemble of models. The more often a pair of observations end up in the same cluster, the higher their consensus value. Consequently, we would like to identify a clustering that respects this trend. Conversely, when the consensus value is close to zero, we would like to select a clustering that did not link the two observations.


To order the clustering algorithms, we propose to measure the distance between their respective connectivity matrix and the consensus matrix. The smaller the distance, the better. We expect that the model with the lowest distance corresponds to a partition that best matches the consensus established by the ensemble. For an arbitrary distance or divergence $D$, \eg the total variation distance or the KL divergence:
\begin{equation}
    \mathcal{S}(\pi^t) = \sum_{i,j} D(\pmb{A}^t_{ij} \| \pmb{C}_{ij}).
\end{equation}
Note that some combinations of inputs are impossible when computing $D$. We cannot have 0 (resp. 1) for connectivity $\pmb{A}_{ij}$ and 1 (resp. 0) for the consensus $\pmb{C}_{ij}$ because the consensus is an average. When both values are 0, or 1, the distance is necessarily 0. Therefore, the only distances we compute correspond to the cases where $\pmb{C}_{ij} \in ]0,1[$. We summarise three examples of distances for this case, which we will use in experiments, in Table~\ref{tab:explicit_distances}.

\input{figures/table_explicit_distances}

It is possible that this score favours solutions with too few or too many clusters. In fact, when clustering models tend to connect most of the samples together through large clusters, the ranking would favour solutions with few clusters because they minimise the number of terms $D(\mathcal{B}(0)\|\mathcal{B}(\pmb{C}_{ij}))$, which incur a large penalty. Conversely, when most clustering models have a large number of clusters, the consensus matrix may become sparse or filled with very low values and the score would favour solutions with many clusters because they minimise the number of terms $D(\mathcal{B}(1)\|\mathcal{B}(\pmb{C}_{ij}))$. When the number of clusters varies from both extremes in the pool of clustering models, then the behaviour of the score would be in favour of solutions with many clusters because the consensus matrix gets low values. 

\input{figures/algorithm_discotec}

In order to alleviate the limitation of having only high values or only low values in the consensus matrix, we propose to binarise it with respect to its mean:
\begin{equation}
\pmb{Q} = \left[\mathbbm{1}\left[\pmb{C}_{ij} \geq n^{-2} \sum_{i^\prime j^\prime} \pmb{C}_{i^\prime j^\prime}\right]\right].
\end{equation}
In this variant, we measure only the absolute differences between zeros and ones from both the connectivity and the consensus matrices. While this binarised consensus matrix is not compatible with the original perspective of statistical distances between two matrices, it can be interpreted as the ratio of mismatching connectivities between observations: the lower the better.

\subsection{Adding regularisations}

An important feature of the proposed score is its compatibility with the approximate measure of informativeness~\citep{davidson_measuring_2006}, \ie the average number of violated constraints. Given a set of $n_\text{ML}$ must-link constraints $\mathcal{C}_{n_\text{ML}} = \{(a_i,b_i)\}_{i=1}^{n_\text{ML}}$, and a set of $n_\text{CL}$ cannot-link constraints $\mathcal{C}_{n_\text{CL}} = \{(a_i,b_i)\}_{i=1}^{n_\text{CL}}$, this regularisation is:
\begin{equation}
\mathcal{R}(\pi^t) = \frac{\sum_{(a,b) \in \mathcal{C}_\text{ML}} D(\pmb{A}_{ab}^t\| 1) +\sum_{(a,b) \in \mathcal{C}_\text{CL}} D(\pmb{A}_{ab}^t\| 0)}{n_\text{ML}+n_\text{CL}}.
\end{equation}
Both the regularisation and our score are contained in [0,1] and correspond to the sum of distances between a connectivity and a target value. Thus, both measures are compatible according to dimensional analysis.

We summarise the binarised version of the discriminative ordering through ensemble consensus (DISCOTEC) in Algorithm~\ref{alg:binary_discotec_algorithm}. We evaluate the computational complexity of this algorithm to $\mathcal{O}(T(n^2+n_\text{ML}+n_\text{CL}))$ for $T$ models and $n$ observations.

We may note that the DISCOTEC scales linearly with the number of models. In comparison, the average NMI (ANMI, \citealp{strehl_cluster_2002}) and the average ARI, which were used for clustering ensemble selection~\citep{fern_cluster_2008}, scale quadratically. These metrics consist in the average of a score between a partition and all other partitions, \eg for ANMI:
\begin{equation}
    \text{ANMI}(\pi^t) = \frac{1}{T-1} \sum_{t^\prime\neq t} \text{NMI}(\pi^t, \pi^{t^\prime}).
\end{equation}
Consequently, evaluating the AARI or ANMI requires $T(T-1)/2$ pairwise computations, which becomes expensive when the number of models is large.