Identifying the Achilles' Heel: An Iterative Method for Uncovering Factual Errors in Large Language Models

ACL ARR 2025 July Submission805 Authors

28 Jul 2025 (modified: 21 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education to mislead users. Current methods for evaluating LLMs' veracity are limited by the need for extensive human labor, test data contamination, or limited scope, hindering efficient and effective error detection. To tackle this problem, we introduce a novel testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs automatically and comprehensively. Our extensive tests on nine prominent LLMs, including Gemini-2.0, Claude-Haiku-3.5, Claude-Sonnet-4.0, GPT-3.5-turbo, GPT-4-turbo, GPT-4o, DeepSeek-v3, Qwen-3, and Qwen-3-reasoning, reveal that FactChecker can trigger factual errors in up to 55% of questions in these models. Moreover, we demonstrate that FactProbe's test cases could amplifies the detection of factual erros across LLMs. All code, data, and results will be released for future research.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: automatic creation and evaluation of language resources, evaluation
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Previous URL: https://openreview.net/forum?id=c8yzu11Rsi
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
A2 Elaboration: There is no potential risks.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Reference Section
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: N/A
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 3.1
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 3.1
C3 Descriptive Statistics: No
C3 Elaboration: The computation resource is not enough
C4 Parameters For Packages: Yes
C4 Elaboration: Section 2.4
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: Appendix G
D2 Recruitment And Payment: N/A
D3 Data Consent: Yes
D3 Elaboration: Appendix G
D4 Ethics Review Board Approval: Yes
D4 Elaboration: Appendix H
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 805
Loading