Keywords: llm, evaluation, ppi, inference, efficient, active, chatbot arena, prediction-powered inference, statistical inference
TL;DR: This work develops a theoretical framework for cost-optimal model evaluation when faced with a choice between different model rating options with different cost vs. performance tradeoffs.
Abstract: The development lifecycle of generative AI systems requires continual evaluation, data acquisition, and annotation, which is costly in both resources and time. In practice, a desire for rapid iteration often makes it necessary to rely on synthetic annotation data because of its low cost, despite the potential for substantial bias. In this paper, we develop a rigorous theoretical framework for novel, cost-aware evaluation pipelines that actively balance the use of a cheap, but often inaccurate, weak rater---such as a model-based autorater that is designed to automatically assess the quality of generated content---with a more expensive, but also more accurate, strong rater such as a human annotator. Building on recent work in active and prediction-powered statistical inference, we theoretically derive a family of cost-optimal policies for allocating a given annotation budget between weak and strong raters so as to maximize statistical efficiency.
Next, using synthetic and real-world data, we empirically characterize conditions under which these types of policies can yield significant improvements over classical methods. Finally, we find that practical approximations of the theoretically optimal policies
can achieve the same estimation precision at a far lower total annotation budget than standard evaluation methods, especially in tasks where there is high variability in the difficulty of examples.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 21747
Loading