On the Scoring Functions for RAG-based Conformal Factuality

Published: 01 Jul 2025, Last Modified: 07 Jul 2025ICML 2025 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: conformal prediction, conformal factuality, scoring function, llm, RAG
TL;DR: The paper compares scoring methods for improving LLM factuality in RAG systems.
Abstract: Large language models (LLMs), despite their effectiveness, often produce hallucinated or non-factual outputs. To mitigate this, conformal factuality frameworks utilize scoring functions to filter model-generated claims and provide statistical factuality guarantees. This study systematically investigates various scoring functions in a retrieval-augmented generation (RAG) context, where a reference text and query inform the LLM’s responses. We evaluate three distinct scoring methods—non-reference model confidence, reference model confidence, and entailment scores—using empirical factuality, power, and false positive rates as metrics. Additionally, we assess the robustness of these scoring functions when the assumption of data exchangeability is mildly violated by incorporating deliberately hallucinated claims. Our findings reveal that reference model confidence scores generally outperform other methods by achieving higher power and improved robustness. However, entailment-based scoring shows the lowest false positive rates under conditions of induced hallucinations. This work highlights the critical importance of scoring function selection to enhance factuality and robustness in RAG-based conformal frameworks.
Submission Number: 130
Loading