Keywords: adaptive testing, efficient evaluation, benchmarking, item response theory, model ranking, continuous metrics
Abstract: Computerized Adaptive Testing (CAT) has proven effective for efficient LLM
evaluation on multiple-choice benchmarks, but modern LLM evaluation
increasingly relies on generation tasks where outputs are scored continuously
rather than marked correct/incorrect. We present a principled extension of
IRT-based adaptive testing to continuous bounded scores (ROUGE, BLEU,
LLM-as-a-Judge) by replacing the Bernoulli response distribution with a
heteroskedastic normal distribution. Building on this, we introduce an uncertainty
aware ranker with adaptive stopping criteria that achieves reliable model ranking
while testing as few items and as cheaply as possible. We validate our method
on five benchmarks spanning n-gram-based, embedding-based, and
LLM-as-judge metrics. Our method uses 2% of the items while improving
ranking correlation by 0.12 $\tau$ over random sampling, with 95% accuracy on
confident predictions.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation, evaluation methodologies, benchmarking, statistical testing for evaluation, LLM Efficiency, NLP in resource-constrained settings;
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 10049
Loading