FACT-BENCH: A Holistic Evaluation of LLMs on Factual Knowledge Recall

ACL ARR 2025 May Submission4811 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We assess LLMs' ability to recall factual knowledge acquired during pretraining, and investigate the factors that influence this capability. To that end, we construct FACT-BENCH, a benchmark designed with three key attributes. First, FACT-BENCH consists of questions with simple, unambiguous answers that remain stable over time, leading to reliable and easy evaluation. Second, it covers 20 domains, 134 property types, various answer types and knowledge popularity levels. Third, FACT-BENCHis programmatically extensible to cover additional factual knowledge of interest from Wikipedia without human annotation. We evaluate 24 models across six families, focusing on three aspects of factual knowledge recall. First, we find that instruction-tuning consistently impairs knowledge recall: models trained only with pretraining outperform their instruction-tuned counterparts. Second, we examine the impact of in-context exemplars using counterfactual demonstrations. These exemplars significantly degrade factual recall, particularly when they contradict knowledge the model already possesses. By further decoupling model known and unknown knowledge--that is, whether the model can correctly recall a fact--we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Third, we fine-tune Llama-3.1-8B under varying conditions of known and unknown knowledge. Fine-tuning on known knowledge proves consistently more effective than fine-tuning on unknown or mixed knowledge.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Question Answering, Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 4811
Loading