Abstract: Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for LLMs, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consistency-based black-box scorers, providing generalizations and extensions of existing methods. We also introduce FactScore-STEM-Geo, a new 400-question long-form QA dataset spanning four categories across STEM and Geography. In our experiments across multiple LLMs and datasets, we find 1) claim-response entailment consistently performs better or on par with more complex claim-level scorers, 2) claim-level scoring generally yields better results than sentence-level scoring, and 3) uncertainty-aware decoding is highly effective for improving the factuality of long-form outputs. Our framework clarifies relationships between prior methods, enables apples-to-apples comparisons, and provides practical guidance for selecting components for fine-grained UQ.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: # Changes in Revision 1 (v2)
We thank all reviewers for their constructive feedback. Below we summarize the changes made in this revision.
## Expanded Experimental Scope
- **Added Llama-4-Maverick-17B** as a fifth LLM (open-weight, non-thinking model from Meta). Results are integrated into all main text tables and figures (Tables 3-4, Figures 5–6), and all applicable appendix tables and figures. Results are consistent with the existing four models, confirming all headline findings.
- **Added white-box baselines** (average token log-probabilities) for response-level analysis for the four core models (Gemini-2.5 and GPT-4o families) in Tables 4, 19-21.
- **Added response-level correlation results for FactScore-STEM-Geo** in Tables 20-21 (Appendix).
## New Analyses
- **Self-preference bias check (Appendix A.1):** We used GPT-4o as an alternative grader on a stratified sample (n=400 per generator-granularity combination). Cohen's κ ranges from 0.63–0.84, with agreement levels consistent regardless of whether the generator matches the original grader.
- **Manual evaluation of claim decomposition (Appendix A.2):** Two authors evaluated 410 claims across 15 responses for faithfulness, standalone interpretability, and coverage. Results show 99.8% faithfulness, 96.8% standalone quality, and ~99.3% recall.
- **Threshold transfer analysis (Appendix A.3):** We investigate how well decision thresholds generalize across datasets and scorers. Cross-dataset transfer incurs minimal performance loss (F1 gaps up to 0.028 for claim-level, 0.049 for sentence-level), while cross-scorer transfer performs substantially worse, particularly for sentence-level scorers.
- **Alternative aggregation strategies (Section 4.4):** We evaluated minimum, geometric mean, and rank-weighted aggregation. Simple averaging consistently matched or outperformed alternatives, while minimum aggregation performed notably worse. Given the extensive appendix tables that already exist, we limited this reporting to discussion only.
- **Claim diversity analysis for thinking models (Table 22):** We report the claim diversity ratio per model to assess whether thinking models produce less diverse outputs.
## Clarifications and Additional Content
- **Clarified that each scorer family contains multiple methods** in Sections 1 and 3, with footnotes distinguishing target LLMs from auxiliary LLMs used for decomposition, question generation, and claim merging.
- **Added Table 2** mapping each scorer to its prior-work origin (exact method, generalization, or new contribution).
- **Expanded discussion of claim-QA's limitations** in Section 5, explaining the inherent tension between atomic claims and claim-QA methods.
- **Added concrete cost analysis** for matched-claim scoring in footnote 9 (~25× the cost of matched-sentence scoring).
- **Added explicit definitions** of LLM accuracy/FactScore metrics in Section 4.1.
- **Mentioned FactScore-STEM-Geo** in the abstract and introduction.
- **Added discussion of selection bias** in FactScore-STEM-Geo (Section 7), noting that selecting longest Wikipedia articles may bias toward well-documented entities.
- **Added Broader Impact discussion** (Section 5) addressing risks of UAD deployment and reduced user visibility into model uncertainty.
- **Added discussion of thinking models** in Section 7, noting that all scorer families operate without modification on thinking budgets and that a targeted study of reasoning models remains future work.
# Changes in Revision 2 (v3)
Added descriptions to Appendix B subsections (B.1-B.4) clarifying what each set of supplemental results demonstrates and linking back to the corresponding main-text sections, per suggestions from Reviewer eJQG.
# Changes in Revision 3 (v4)
Updated references to ensure published version rather than preprint wherever possible, per suggestions from Reviewer 9sfN.
Assigned Action Editor: ~Polina_Kirichenko1
Submission Number: 7590
Loading