Abstract: In the deployment of Large Language Models (LLMs), ``spurious correctness''---where answers are correct but reasoning contains errors---poses a critical risk by creating an illusion of reliability. While prior work on LLM confidence estimation focuses on answer-level or entire reasoning path confidence, these coarse-grained approaches fail to identify which specific parts of the reasoning contain errors. We propose a fine-grained confidence estimation framework that computes confidence scores for individual evidence triplets within reasoning chains, enabling precise localization of errors. We use special prompts to generates answers, evidence in triplet format, and their respective confidence scores simultaneously, allowing automatic detection of spurious correctness patterns where partial evidence contains factual errors. Evaluated on Japanese multihop QA across three model families representing different architectures and training approaches, we show that our approach exhibits superior calibration performance for evidence confidence and delivers strong ability to detect spurious correct answers (up to 84% discrimination accuracy). As a secondary benefit, joint generation of confidence scores improves answer confidence calibration by up to 43%. This prompt-based approach requires no model retraining and is immediately applicable to existing LLMs.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Interpretability and Analysis of Models for NLP, Question Answering
Contribution Types: Model analysis & interpretability
Languages Studied: Japanese
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Section 4.1 cites JEMHopQA dataset (Ishii et al., 2024). Models used (GPT-4.1-mini, Llama-4-Maverick, Phi-4) are described in Section 4.2 with version information and access methods. Where available model papers exist, they are cited; for models without formal publications, we provide version details and API endpoints for reproducibility.
B2 Discuss The License For Artifacts: No
B2 Elaboration: We use publicly available models via commercial API services (Azure AI Foundry) under their terms of service. JEMHopQA dataset is used for research purposes consistent with its intended use. Specific licenses were not discussed as we do not redistribute any artifacts.
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Section 4.1 states we use JEMHopQA for evaluation, consistent with its intended research purpose. Models are accessed via official APIs for research use as permitted by their terms of service.
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section 4.1 describes JEMHopQA as a Japanese multi-hop QA dataset. Section 4.2 details model architectures, training paradigms, and parameter counts where available.
B6 Statistics For Data: Yes
B6 Elaboration: Section 4.1 reports using 1,000 questions from JEMHopQA training split for evaluation and 3 for few-shot examples, with average 2.2 evidence triplets per question.
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 4.2 reports parameter counts for available models (Llama-4: 17B, Phi-4: 14B). GPT-4.1-mini parameters not publicly available. Computational costs not reported as we used API services rather than direct computation.
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 4.2 specifies temperature settings (0.0 for most methods, 0.7 with top-p 0.95 for Label prob.), model versions, and API access details.
C3 Descriptive Statistics: Yes
C3 Elaboration: All tables report results on the full evaluation set (1,000 instances). Section 4.3 describes temperature scaling via 5-fold cross-validation. No error bars as results are from single runs on the complete dataset.
C4 Parameters For Packages: Yes
C4 Elaboration: Section 3.2 footnote mentions RapidFuzz for fuzzy string matching with similarity threshold. Section 4.3 describes temperature scaling implementation details.
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: AI assistants were used for grammar checking and code formatting assistance. Due to anonymous submission requirements, detailed acknowledgment will be provided in the camera-ready version.
Author Submission Checklist: yes
Submission Number: 1023
Loading