Understanding Conformal Factuality for RAG-based LLMs: Novel Metrics and Systematic Insights

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, hallucinations, factuality, conformal prediction, retrieval-augmented generation (RAG), evaluation metrics, robustness
TL;DR: This paper integrates conformal factuality filtering with retrieval-augmented generation, introduces new factuality metrics, and shows through large-scale experiments that standard measures miss key trade-offs between correctness and informativeness.
Abstract: Large language models (LLMs) are powerful generic generative models which can often generate responses that are plausible but not grounded in factual reality usually referred to as “hallucinations”. This is a challenge for using LLMs in applications that require answers that are factually correct. Two approaches have emerged as promising ways to mitigate this issue in the literature: (i) Conformal factuality filtering framework that provides statistical guarantee on the factual accuracy of claims in the final output, but cannot mitigate the hallucinations in the response generation and (ii) retrieval-augmented generation (RAG), which utilizes trusted knowledge bases as reference to guide the generation of response with the aim of reducing hallucinations but does not offer statistical guarantees. In this work, we unite these two approaches by integrating a conformal factuality framework with RAG and systematically study their performance to understand their strengths and limitations. We investigate the role of different key components: reference for generation and scoring functions, sensitivity to calibration data, LLM model capacity, reasoning and robustness to distractors. We propose three new metrics: \emph{non-empty rate}, \emph{non-vacuous empirical factuality}, and \emph{sufficient correctness} to address limitations of standard factuality measures that fail to meaningfully capture usefulness of the output. Our experiments are comprehensive spanning three datasets (FActScore, MATH, and Natural Questions) and multiple model families and sizes. Our results show the importance of designing scoring functions and highlight trade-offs between correctness and informativeness that standard metrics fail to capture. Together, our findings provide insights that are practically useful and sheds light on the importance of re-thinking LLM factuality.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6253
Loading