Estimating Pass@$k$ from Fewer Samples with Hierarchical Bayesian Priors

Published: 25 May 2026, Last Modified: 29 May 2026CTB@ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Pass at k, code generation, evaluation, llm
TL;DR: Prediction Pass@k for low budget and predicting improvement using Bayesian Hierarchical Model
Abstract: Large Language Models are commonly evaluated on coding tasks using sampling-based metrics such as Pass@$k$, the probability of generating at least one correct solution after $k$ independent generations. Estimating Pass@$k$ curves from limited evaluation samples is important for benchmark design and stress testing, but can require many generations per task when per-sample success probabilities are small. We study this low-evaluation-budget regime using standard empirical-Bayes hierarchical priors over task-level success probabilities. The resulting posterior-predictive estimators pool information across tasks to estimate dataset-level Pass@$k$ curves and to diagnose when additional sampling is likely to help. We also study a Beta--Binomial improvability diagnostic, $\Delta\mathrm{Pass}$, whose interpretation is tied to the fitted-prior approximation. Across CodeContests, MPBB, and HumanEval, the experiments show complementary regimes: low-pass@1 tradeoffs, high-pass@1 Pareto frontiers, and a near-zero boundary-mass setting where explicit zero inflation is particularly informative.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 51
Loading