Keywords: large language models, evaluations, Gaussian approximation, Monte-Carlo simulation, efficient, rapid
TL;DR: Metrics for efficiently and rapidly evaluating LLM performance.
Abstract: Test-time scaling methods, such as voting, have emerged as a powerful paradigm to dramatically improve the performance of large language models (LLMs). Majority voting is often useful however to estimate the tradeoff between task performance (e.g., accuracy) and computational cost, as we vary the size of ensemble used in voting, denoted $M$; or as we vary hyperparameters, such as Temperature, in pursuit of a more favorable tradeoff. In the literature, evaluating voting accuracy performance is done using a purely empirical approach that requires many LLM evaluations and is highly computationally intensive. In this work we propose two methods to estimate the voting accuracy of an LLM with substantially less computational cost than current methods. Using a popular public benchmark datasets of LLM problems (MATH) we demonstrate that our two estimation approaches can closely approximate the true ensemble accuracy, with substantially less computational cost than current methods less computation than a purely empirical approach, especially as the number of votes grows larger.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13148
Loading