Keywords: evaluation, uncertainty quantification, statistics, Bayesian, LLM
Abstract: It is increasingly important to evaluate large language models (LLMs) in terms of "behaviors," such as their tendency to produce toxic output or their sensitivity to adversarial prompts. Such evaluations often rely on a set of benchmark prompts, where the output for each prompt is evaluated in a binary fashion (e.g., refused/not refused or toxic/non-toxic), and the aggregation of binary scores is used to evaluate the LLM. We present two preliminary case studies applying this approach: 1) evaluating refusal rates on JailBreakBench, and 2) evaluating pairwise preferences of one LLM over another on MT-Bench, demonstrating how the Bayesian approach can provide uncertainty quantification of LLM behavior.
Submission Number: 141
Loading