Efficient Cost-Aware LLM Evaluation via Bayesian Bandit Gittins Indices

Published: 25 May 2026, Last Modified: 29 May 2026DEMO 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Evaluation, Bayesian Bandits, Gittins Indices
TL;DR: We formulate efficient LLM evaluation as a cost-aware Bayesian bandit problem and use Gittins indices to adaptively allocate benchmark queries under heterogeneous evaluation costs.
Abstract: Selecting a high-performing LLM configuration increasingly requires comparing many candidate models, prompts, or decoding settings under limited evaluation budgets. Exhaustively evaluating every candidate on every benchmark example can be expensive, and each additional query incurs costs such as API pricing, token usage, latency, or human grading effort. We formulate this configuration-selection task as *cost-aware Bayesian bandit evaluation* and propose a *Gittins index policy* that treats each configuration's benchmark performance as latent, updates posterior uncertainty, allocates queries according to their value of information, and provides a stopping signal. The method combines sample efficiency with computational efficiency: after precomputing Gittins indices for the Gaussian posterior dynamics, online allocation only requires table lookup and posterior updates, and it naturally handles varying evaluation costs. Across GSM8K, PIQA, and MMLU response matrices of different sizes, our Gittins index policies usually reach lower simple regret with fewer evaluations or lower cumulative cost than UCB-E, a frequentist best-arm identification method, and UCB-E-LRF, its low-rank correlation-aware extension.
Submission Number: 119
Loading