Abstract: Large language models (LLMs) have demonstrated strong capabilities in encoding and applying factual knowledge, much of which follows a one-to-many (1-to-N) structure, where a single query corresponds to multiple valid answers.
However, the existing metrics for
evaluating 1-to-N knowledge
suffer from inherent limitations, such as ignoring valid alternative answers,
failing to reflect model confidence,
or neglecting probability distributions.
To address these limitations, we propose a new metric, named N-Answer Kullback-Leibler Divergence (NKL), which aligns the predicted probability distribution of an LLM with a given gold distribution (e.g. a pre-training corpus). NKL integrates both ranking and probability information, offering a more comprehensive evaluation.
We also formalise 1-to-N knowledge evaluation with two criteria—coverage and alignment—under which NKL demonstrates the best overall performance. Experiments on Counterfact and SNOMED CT further validate NKL’s effectiveness in knowledge probing and editing, providing new insights into LLMs’ ability to represent and modify 1-to-N knowledge.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: 1-to-N knowledge, evaluation metric, knowledge probing, knowledge editing
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 1404
Loading