Quantifying Uncertainty in Error Consistency: Towards Reliable Behavioral Comparison of Classifiers

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Error Consistency, Behavioral Alignment, Vision
TL;DR: We present a method for calculating confidence intervals around error consistency values and revisit earlier work using this metric, showing that many differences found between DNN models of vision are statistically insignificant.
Abstract: Benchmarking models is a key factor for the rapid progress in machine learning (ML) research. Thus, further progress depends on improving benchmarking metrics. A standard metric to measure the behavioral alignment between ML models and human observers is error consistency (EC). EC allows for more fine-grained comparisons of behavior than other metrics such as accuracy, and has been used in the influential Brain-Score benchmark to rank different DNNs by their behavioral consistency with humans. Previously, EC values have been reported without confidence intervals. However, empirically measured EC values are typically noisy - thus, without confidence intervals, valid benchmarking conclusions are problematic. Here we improve on standard EC in two ways: First, we show how to obtain confidence intervals for EC using a bootstrapping technique, allowing us to derive significance tests for EC. Second, we propose a new computational model relating the EC between two classifiers to the implicit probability that one of them copies responses from the other. This view of EC allows us to give practical guidance to scientists regarding the number of trials required for sufficiently powerful, conclusive experiments. Finally, we use our methodology to revisit popular NeuroAI-results. We find that while the general trend of behavioral differences between humans and machines holds up to scrutiny, many reported differences between deep vision models are statistically insignificant. Our methodology enables researchers to design adequately powered experiments that can reliably detect behavioral differences between models, providing a foundation for more rigorous benchmarking of behavioral alignment.
Supplementary Material: zip
Primary Area: Neuroscience and cognitive science (e.g., neural coding, brain-computer interfaces)
Submission Number: 27539
Loading