Atomic Calibration of LLMs in Long-Form Generations

Atomic Calibration of LLMs in Long-Form Generations

ACL ARR 2025 May Submission2241 Authors

18 May 2025 (modified: 04 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, as an effective indicator of hallucination, is thus essential to enhance the trustworthiness of LLMs. Prior work mainly focuses on short-form tasks using a single response-level score (macro calibration), which is insufficient for long-form outputs that may contain both accurate and inaccurate claims. In this work, we systematically study **atomic calibration**, which evaluates factuality calibration at a fine-grained level by decomposing long responses into atomic claims. We further categorize existing confidence elicitation methods into **discriminative** and **generative** types, and propose two new confidence fusion strategies to improve calibration. Our experiments demonstrate that LLMs exhibit poorer calibration at the atomic level during long-form generation. More importantly, atomic calibration uncovers insightful patterns regarding the alignment of confidence methods and the changes of confidence throughout generation. This sheds light on future research directions for confidence estimation in long-form generation.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: long-form generation, confidence calibration

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study

Languages Studied: English

Submission Number: 2241

Loading