Compute-efficient Evaluation of LLM Voting Accuracy

I. Taylor McKechnie; Paul J. Markakis; Kaleb Kassaw; Francesco Luzi; Sohini Saha; Boyla Mainsah; Leslie Collins; Jordan Malof

Compute-efficient Evaluation of LLM Voting Accuracy

I. Taylor McKechnie, Paul J. Markakis, Kaleb Kassaw, Francesco Luzi, Sohini Saha, Boyla Mainsah, Leslie Collins, Jordan Malof

18 Sept 2025 (modified: 04 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, evaluations, Gaussian approximation, Monte-Carlo simulation, efficient, rapid

TL;DR: Metrics for efficiently and rapidly evaluating LLM performance.

Abstract: Test-time scaling methods, such as voting, have emerged as a powerful paradigm to dramatically improve the performance of large language models (LLMs). Majority voting is often useful however to estimate the tradeoff between task performance (e.g., accuracy) and computational cost, as we vary the size of ensemble used in voting, denoted $M$; or as we vary hyperparameters, such as Temperature, in pursuit of a more favorable tradeoff. In the literature, evaluating voting accuracy performance is done using a purely empirical approach that requires many LLM evaluations and is highly computationally intensive. In this work we propose two methods to estimate the voting accuracy of an LLM with substantially less computational cost than current methods. Using a popular public benchmark datasets of LLM problems (MATH) we demonstrate that our two estimation approaches can closely approximate the true ensemble accuracy, with substantially less computational cost than current methods less computation than a purely empirical approach, especially as the number of votes grows larger.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 13148

Loading