Fine-grained Confidence Estimation for Spurious Correctness Detection in Large Language Models

Fine-grained Confidence Estimation for Spurious Correctness Detection in Large Language Models

ACL ARR 2025 July Submission1023 Authors

29 Jul 2025 (modified: 07 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In the deployment of Large Language Models (LLMs), ``spurious correctness''---where answers are correct but reasoning contains errors---poses a critical risk by creating an illusion of reliability. While prior work on LLM confidence estimation focuses on answer-level or entire reasoning path confidence, these coarse-grained approaches fail to identify which specific parts of the reasoning contain errors. We propose a fine-grained confidence estimation framework that computes confidence scores for individual evidence triplets within reasoning chains, enabling precise localization of errors. We use special prompts to generates answers, evidence in triplet format, and their respective confidence scores simultaneously, allowing automatic detection of spurious correctness patterns where partial evidence contains factual errors. Evaluated on Japanese multihop QA across three model families representing different architectures and training approaches, we show that our approach exhibits superior calibration performance for evidence confidence and delivers strong ability to detect spurious correct answers (up to 84% discrimination accuracy). As a secondary benefit, joint generation of confidence scores improves answer confidence calibration by up to 43%. This prompt-based approach requires no model retraining and is immediately applicable to existing LLMs.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Interpretability and Analysis of Models for NLP, Question Answering

Contribution Types: Model analysis & interpretability

Languages Studied: Japanese

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 4.1 cites JEMHopQA dataset (Ishii et al., 2024). Models used (GPT-4.1-mini, Llama-4-Maverick, Phi-4) are described in Section 4.2 with version information and access methods. Where available model papers exist, they are cited; for models without formal publications, we provide version details and API endpoints for reproducibility.

B2 Discuss The License For Artifacts: No

B2 Elaboration: We use publicly available models via commercial API services (Azure AI Foundry) under their terms of service. JEMHopQA dataset is used for research purposes consistent with its intended use. Specific licenses were not discussed as we do not redistribute any artifacts.

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: Section 4.1 states we use JEMHopQA for evaluation, consistent with its intended research purpose. Models are accessed via official APIs for research use as permitted by their terms of service.

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Section 4.1 describes JEMHopQA as a Japanese multi-hop QA dataset. Section 4.2 details model architectures, training paradigms, and parameter counts where available.

B6 Statistics For Data: Yes

B6 Elaboration: Section 4.1 reports using 1,000 questions from JEMHopQA training split for evaluation and 3 for few-shot examples, with average 2.2 evidence triplets per question.

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Section 4.2 reports parameter counts for available models (Llama-4: 17B, Phi-4: 14B). GPT-4.1-mini parameters not publicly available. Computational costs not reported as we used API services rather than direct computation.

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 4.2 specifies temperature settings (0.0 for most methods, 0.7 with top-p 0.95 for Label prob.), model versions, and API access details.

C3 Descriptive Statistics: Yes

C3 Elaboration: All tables report results on the full evaluation set (1,000 instances). Section 4.3 describes temperature scaling via 5-fold cross-validation. No error bars as results are from single runs on the complete dataset.

C4 Parameters For Packages: Yes

C4 Elaboration: Section 3.2 footnote mentions RapidFuzz for fuzzy string matching with similarity threshold. Section 4.3 describes temperature scaling implementation details.

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: AI assistants were used for grammar checking and code formatting assistance. Due to anonymous submission requirements, detailed acknowledgment will be provided in the camera-ready version.

Author Submission Checklist: yes

Submission Number: 1023

Loading