Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation
Keywords: Uncertainty, Evaluation, Langauge Models
TL;DR: Investiagting pitfalls in evaluating uncertainty estimation methods for NLG and addressing them.
Abstract: Hallucinations are a common issue that undermine the reliability of large language models (LLMs).
Recent studies have identified a specific subset of hallucinations, known as confabulations, which arise due to predictive uncertainty of LLMs.
To detect confabulations, various methods for estimating predictive uncertainty in natural language generation (NLG) have been developed.
These methods are typically evaluated by correlating uncertainty estimates with the correctness of generated text, with question-answering (QA) datasets serving as the standard benchmark.
However, evaluating correctness in QA tasks is inherently challenging and can distort the perceived effectiveness of uncertainty estimation methods.
Our results show that there is substantial disagreement between correctness functions and consequently the ranking of the uncertainty estimation methods is significantly influenced by that choice, allowing to inflate the performance of uncertainty estimation methods.
We propose several improvements to overcome these pitfalls.
For QA tasks, we show that averaging over multiple LLM-as-a-judge variants leads to more reliable results.
Furthermore, we explore structured tasks which provide unambiguous correctness functions.
Finally, we propose to use an Elo rating of uncertainty estimation methods to give an objective summarization over extensive evaluation settings.
Submission Number: 26
Loading