Identifying the Achilles' Heel: An Iterative Method for Uncovering Factual Errors in Large Language Models

Identifying the Achilles' Heel: An Iterative Method for Uncovering Factual Errors in Large Language Models

ACL ARR 2025 July Submission805 Authors

28 Jul 2025 (modified: 21 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education to mislead users. Current methods for evaluating LLMs' veracity are limited by the need for extensive human labor, test data contamination, or limited scope, hindering efficient and effective error detection. To tackle this problem, we introduce a novel testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs automatically and comprehensively. Our extensive tests on nine prominent LLMs, including Gemini-2.0, Claude-Haiku-3.5, Claude-Sonnet-4.0, GPT-3.5-turbo, GPT-4-turbo, GPT-4o, DeepSeek-v3, Qwen-3, and Qwen-3-reasoning, reveal that FactChecker can trigger factual errors in up to 55% of questions in these models. Moreover, we demonstrate that FactProbe's test cases could amplifies the detection of factual erros across LLMs. All code, data, and results will be released for future research.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: automatic creation and evaluation of language resources, evaluation

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Previous URL: https://openreview.net/forum?id=c8yzu11Rsi

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).

Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

A2 Elaboration: There is no potential risks.

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Reference Section

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: N/A

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Section 3.1

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 3.1

C3 Descriptive Statistics: No

C3 Elaboration: The computation resource is not enough

C4 Parameters For Packages: Yes

C4 Elaboration: Section 2.4

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: Yes

D1 Elaboration: Appendix G

D2 Recruitment And Payment: N/A

D3 Data Consent: Yes

D3 Elaboration: Appendix G

D4 Ethics Review Board Approval: Yes

D4 Elaboration: Appendix H

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 805

Loading