BELIEFs: Bias-resilient, Multifaceted Evaluation of Language Models in Factual Knowledge Understanding
Abstract: The fill-in-the-blank prompts are widely used to evaluate how well pre-trained language models (PLMs) capture real-world factual knowledge.
However, the prompt-based evaluation results vary significantly depending on the linguistic expressions of the prompts, even for the same knowledge.
To assess PLMs' capability to understand facts more fairly, we introduce a new dataset called MyriadLAMA, along with the evaluation benchmarks BELIEF and its variant BELIEF-ICL to evaluate encoder- and decoder-based PLMs, respectively.
MyriadLAMA presents diverse fill-in-the-blank prompts for the same fact, leveraged by BELIEFs not only to mitigate prompt bias during factual knowledge probing by consolidating results from multiple prompts but also to offer a comprehensive evaluation of factual knowledge in PLMs, including accuracy, consistency and reliability.
We validate the efficacy of the BELIEFs through comprehensive evaluations of encoder-based and decoder-based PLMs.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Interpretability and Analysis of Models for NLP,Information Extraction,Language Modeling
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 793
Loading