Assessing Factual Reliability of Large Language Model Knowledge

Anonymous

Assessing Factual Reliability of Large Language Model Knowledge

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone

TL;DR: A novel distance-based metric, directly computing the output probabilities and their changes to address “accuracy instability” caused by the prompt framing effect and in-context interference.

Abstract: The factual knowledge of LLMs is typically evaluated using accuracy, yet this metric does not capture the vulnerability of LLMs to hallucination-inducing factors like prompt and context variability. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? In this paper, we propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability. MONITOR is designed to compute the distance between the probability distributions of a valid output and its counterparts produced by the same LLM probing the same fact using different styles of prompts and contexts. Experiments on a comprehensive range of 12 LLMs demonstrate the effectiveness of MONITOR in evaluating the factual reliability of LLMs while maintaining a low computational overhead. In addition, we will release the FKTC (Factual Knowledge Test Corpus) to foster research along this line.

Paper Type: long

Research Area: Resources and Evaluation

Contribution Types: Model analysis & interpretability

Languages Studied: English

0 Replies

Loading