In this experiment, we further explore the relationship between the number models and the quality of the ranking. We keep the initial synthetic scenario from our first experiment in Section~\ref{sec:experiment}, where a ground truth is first generated and then $T$ models are created by preserving between 10\% and $\rho_\text{max}$ of the labels. The resulting models have an accuracy bounded between 10\% and $\rho_\text{max}$ on average. For three specific thresholds $\rho_\text{max}\in \{0.2, 0.5, 0.9\}$, which correspond to a decreasing difficulty of consensus, we increase the number of models $T$ from 5 models to 200. We report the average correlations for 50 runs per value of $T$ and $\rho_\text{max}$. Figure~\ref{fig:scaling_pearson} corresponds to the Pearson correlation and Figure~\ref{fig:scaling_kendall} corresponds to Kendall's tau correlation coefficient. For clarity of both figures, we omitted the squared Hellinger and total variation distances because they perfectly followed the KL curve.

\input{figures/fig_scaling_pearson}
\input{figures/fig_scaling_kendall}

We observe from both figures that the scaling depends on the difficulty to reach a consensus. When a consensus is hard to find, \ie $\rho_\text{max}=0.2$, even 200 models is insufficient to establish a strong correlation between ranking and ARI with targets. In contrast, an easy scenario, \ie $\rho_\text{max}=0.9$, requires few models to achieve excellent correlations as we are already close to 1 with 20 models. In a mitigated scenario, increasing the number of models increases steadily the correlations. It is notable that the binary DISCOTEC displays stronger performances even in a mitigated scenario compared to the DISCOTEC based on the KL distance.

We conclude that when the consensus is not clear-cut, adding models in the ensemble may be beneficial to the DISCOTEC ranking.