Bayesian Evaluation of Blackbox LLM Behavior

Rachel Longjohn; Shang Wu; Catarina G Belém; Saatvik Kher; Padhraic Smyth

Bayesian Evaluation of Blackbox LLM Behavior

Rachel Longjohn, Shang Wu, Catarina G Belém, Saatvik Kher, Padhraic Smyth

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: evaluation, uncertainty quantification, statistics, Bayesian, LLM

Abstract: It is increasingly important to evaluate large language models (LLMs) in terms of "behaviors," such as their tendency to produce toxic output or their sensitivity to adversarial prompts. Such evaluations often rely on a set of benchmark prompts, where the output for each prompt is evaluated in a binary fashion (e.g., refused/not refused or toxic/non-toxic), and the aggregation of binary scores is used to evaluate the LLM. We present two preliminary case studies applying this approach: 1) evaluating refusal rates on JailBreakBench, and 2) evaluating pairwise preferences of one LLM over another on MT-Bench, demonstrating how the Bayesian approach can provide uncertainty quantification of LLM behavior.

Submission Number: 141

Loading