BELIEFs: Bias-resilient, Multifaceted Evaluation of Language Models in Factual Knowledge Understanding

ACL ARR 2024 April Submission793 Authors

16 Apr 2024 (modified: 23 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The fill-in-the-blank prompts are widely used to evaluate how well pre-trained language models (PLMs) capture real-world factual knowledge. However, the prompt-based evaluation results vary significantly depending on the linguistic expressions of the prompts, even for the same knowledge. To assess PLMs' capability to understand facts more fairly, we introduce a new dataset called MyriadLAMA, along with the evaluation benchmarks BELIEF and its variant BELIEF-ICL to evaluate encoder- and decoder-based PLMs, respectively. MyriadLAMA presents diverse fill-in-the-blank prompts for the same fact, leveraged by BELIEFs not only to mitigate prompt bias during factual knowledge probing by consolidating results from multiple prompts but also to offer a comprehensive evaluation of factual knowledge in PLMs, including accuracy, consistency and reliability. We validate the efficacy of the BELIEFs through comprehensive evaluations of encoder-based and decoder-based PLMs.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Interpretability and Analysis of Models for NLP,Information Extraction,Language Modeling
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 793
Loading