TPEval: A Novel Truth-Preserving Evaluation Method for Probing LLMs Professional Factual Knowledge MasteryDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Using large language models (LLMs) to solve problems in professional fields (e.g., medicine) is emerging as a research hotspot, requiring LLMs to master sufficient domain-specific factual knowledge. Recently, several LLMs achieved notable performance on multiple professional-field evaluation benchmarks. However, current benchmarks generally leverage common and fixed question formulations, allowing LLMs to provide correct answers based on surface-level patterns in questions without mastering the underlying knowledge. In this paper, we focus on this problem. We propose a general \textbf{t}ruth-\textbf{p}reserving \textbf{eval}uation framework (\textbf{TPEval}) to precisely probe LLMs' mastery of factual knowledge in professional fields through distinct representations of the same knowledge. Specifically, for each piece of knowledge, we convert its original expression into multiple truth-preserving statements with logical transformations, presenting the knowledge in diverse ways. By leveraging these statements, the proposed framework can more precisely estimate LLMs' mastery of the specified knowledge. Given the wealth of factual knowledge in medicine, we validate the effectiveness of our framework in the medical domain. We curate 6,000+ clinical facts and generate eight statements for each fact using the proposed method, evaluating the mastery of LLMs. Experimental results indicate a notable decline in LLMs' performance as the number of statements per fact increases, suggesting insufficient knowledge mastery of LLMs. Our method can serve as an effective solution for probing LLMs' knowledge mastery in professional fields.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: NLP engineering experiment
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview