How Many Raters Do You Need? Power Analysis for Foundation Models

NeurIPS 2023 Workshop ICBINB Submission11 Authors

Published: 27 Oct 2023, Last Modified: 01 Dec 2023ICBINB 2023EveryoneRevisionsBibTeX
Keywords: foundation models, evaluation, response variance
TL;DR: We propose a method for significance testing the performance of foundational models.
Abstract: Due to their highly stochastic nature, as well as the complexity of the tasks they can perform, foundation models (large machine learning models) are poorly suited for conventional machine learning evaluation methods. This is because machine learning evaluation methods typically assume behavior to be deterministic and simple enough to be measured against gold standard data with unitary, authoritative, "correct" answers using straightforward metrics such as accuracy, precision, and recall. In this work, we propose an evaluation framework suitable for foundation models, which takes into account variance in the responses of both machine model and human rater. Utilizing recent advances in p-value estimation, we investigate the trade-offs between the number of items in a test set, the number of responses per item, the sampling method, and the metric, when measuring the comparative differences between two hypothetical foundation models at various degrees of similarity. When two models are very far apart in their predictive performance, fewer raters are needed to confidently compare them, as expected. However, as the models draw closer, we find that a larger number of annotators than are currently typical in annotation collection are needed to ensure the power analysis correctly reflects the difference in performance.
Submission Number: 11