GPTKB: Comprehensively Materializing Factual LLM Knowledge

ACL ARR 2024 December Submission7 Authors

02 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: LLMs have majorly advanced NLP and AI, and next to their ability to perform a wide range of procedural tasks, a major success factor is their internalized factual knowledge. Since (Petroni et al., 2019), analyzing this knowledge has gained attention. However, most approaches investigate one question at a time via modest-sized pre-defined samples, introducing an *availability bias* (Tversky and Kahnemann, 1973) that prevents the discovery of knowledge (or beliefs) of LLMs beyond the experimenter's predisposition. To address this challenge, we propose a novel methodology to comprehensively materializing an LLM's factual knowledge through recursive querying and result consolidation. As a prototype, we employ GPT-4o-mini to construct GPTKB, a large-scale knowledge base (KB) comprising 105 million triples for over 2.9 million entities - achieved at 1% of the cost of previous KB projects. This work marks a milestone in two areas: (1) For LLM research, for the first time, it provides *constructive* insights into the scope and structure of LLMs' knowledge (or beliefs). For KB construction, it pioneers *new pathways* for the long-standing challenge of general-domain KB construction. GPTKB is accessible at <anonymized>.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: model bias, open information extraction, knowledge base construction, knowledge tracing/discovering, probing, benchmarking, evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 7
Loading