Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study

Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study

19 Feb 2026 (modified: 13 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for LLMs, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consistency-based black-box scorers, providing generalizations and extensions of existing methods. We also introduce FactScore-STEM-Geo, a new 400-question long-form QA dataset spanning four categories across STEM and Geography. In our experiments across multiple LLMs and datasets, we find 1) claim-response entailment consistently performs better or on par with more complex claim-level scorers, 2) claim-level scoring generally yields better results than sentence-level scoring, and 3) uncertainty-aware decoding is highly effective for improving the factuality of long-form outputs. Our framework clarifies relationships between prior methods, enables apples-to-apples comparisons, and provides practical guidance for selecting components for fine-grained UQ.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: # Changes in Revision 1 (v2) We thank all reviewers for their constructive feedback. Below we summarize the changes made in this revision. ## Expanded Experimental Scope - **Added Llama-4-Maverick-17B** as a fifth LLM (open-weight, non-thinking model from Meta). Results are integrated into all main text tables and figures (Tables 3-4, Figures 5–6), and all applicable appendix tables and figures. Results are consistent with the existing four models, confirming all headline findings. - **Added white-box baselines** (average token log-probabilities) for response-level analysis for the four core models (Gemini-2.5 and GPT-4o families) in Tables 4, 19-21. - **Added response-level correlation results for FactScore-STEM-Geo** in Tables 20-21 (Appendix). ## New Analyses - **Self-preference bias check (Appendix A.1):** We used GPT-4o as an alternative grader on a stratified sample (n=400 per generator-granularity combination). Cohen's κ ranges from 0.63–0.84, with agreement levels consistent regardless of whether the generator matches the original grader. - **Manual evaluation of claim decomposition (Appendix A.2):** Two authors evaluated 410 claims across 15 responses for faithfulness, standalone interpretability, and coverage. Results show 99.8% faithfulness, 96.8% standalone quality, and ~99.3% recall. - **Threshold transfer analysis (Appendix A.3):** We investigate how well decision thresholds generalize across datasets and scorers. Cross-dataset transfer incurs minimal performance loss (F1 gaps up to 0.028 for claim-level, 0.049 for sentence-level), while cross-scorer transfer performs substantially worse, particularly for sentence-level scorers. - **Alternative aggregation strategies (Section 4.4):** We evaluated minimum, geometric mean, and rank-weighted aggregation. Simple averaging consistently matched or outperformed alternatives, while minimum aggregation performed notably worse. Given the extensive appendix tables that already exist, we limited this reporting to discussion only. - **Claim diversity analysis for thinking models (Table 22):** We report the claim diversity ratio per model to assess whether thinking models produce less diverse outputs. ## Clarifications and Additional Content - **Clarified that each scorer family contains multiple methods** in Sections 1 and 3, with footnotes distinguishing target LLMs from auxiliary LLMs used for decomposition, question generation, and claim merging. - **Added Table 2** mapping each scorer to its prior-work origin (exact method, generalization, or new contribution). - **Expanded discussion of claim-QA's limitations** in Section 5, explaining the inherent tension between atomic claims and claim-QA methods. - **Added concrete cost analysis** for matched-claim scoring in footnote 9 (~25× the cost of matched-sentence scoring). - **Added explicit definitions** of LLM accuracy/FactScore metrics in Section 4.1. - **Mentioned FactScore-STEM-Geo** in the abstract and introduction. - **Added discussion of selection bias** in FactScore-STEM-Geo (Section 7), noting that selecting longest Wikipedia articles may bias toward well-documented entities. - **Added Broader Impact discussion** (Section 5) addressing risks of UAD deployment and reduced user visibility into model uncertainty. - **Added discussion of thinking models** in Section 7, noting that all scorer families operate without modification on thinking budgets and that a targeted study of reasoning models remains future work. # Changes in Revision 2 (v3) Added descriptions to Appendix B subsections (B.1-B.4) clarifying what each set of supplemental results demonstrates and linking back to the corresponding main-text sections, per suggestions from Reviewer eJQG. # Changes in Revision 3 (v4) Updated references to ensure published version rather than preprint wherever possible, per suggestions from Reviewer 9sfN.

Assigned Action Editor: ~Polina_Kirichenko1

Submission Number: 7590

Loading