Bayesian Calibration of Win Rate Estimation with LLM Evaluators

ACL ARR 2024 June Submission5624 Authors

16 Jun 2024 (modified: 22 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances in large language models (LLMs) show the potential of using LLMs as evaluators for assessing the quality of generations from LLMs. However, applying LLM evaluators naively to compare different systems can lead to unreliable results due to inaccuracy and intrinsic bias of LLM evaluators. In order to mitigate this problem, we propose two calibration methods, Bayesian Win-Rate Sampling (BWRS) and Bayesian Dawid-Skene, which both leverage Bayesian inference to more accurately infer the true win rate of generative language models. We empirically validate our methods on seven datasets including story generation, summarization, and instruction following. We show that both our methods are effective in improving the accuracy of win rate estimation using LLMs as evaluators, offering a promising direction for reliable automatic text quality evaluation.
Paper Type: Long
Research Area: Generation
Research Area Keywords: automatic evaluation, evaluation methodologies
Languages Studied: English
Submission Number: 5624
Loading