\section{Introduction}

In recent years, researchers have developed stronger large language models that perform well on a variety of tasks across different domains \citep{touvron2023llama, bubeck2023sparks, anil2023palm}. % jiang2023mistral
% In particular, researchers have demonstrated the capabilities of such models by showcasing their ability to generate long-form text. 
However, as use of LLMs continues to grow, so do concerns over their tendency to hallucinate facts \citep{huang2023survey}. As a result, there is a growing need for methods that can reduce hallucinations \citep{manakul2023selfcheckgpt, zhang2023alleviating}, perform abstention \citep{yang2023alignment}, or provide correctness guarantees \citep{kumar2023conformal, mohri2024language, quach2023conformal}. Our work focuses on the latter---broadly speaking, uncertainty quantification of long-form large language model generations.
% , which is a problem that has not previously been exhaustively explored. 

Concretely, given a set of claims produced by an LLM in response to some prompt, our goal is to provide a confidence score or uncertainty guarantee about the factual correctness of the output. We explore this problem in two settings: given a set of claims contained within some long-form prompt response, we \textbf{(1)} ensure factuality at the individual claim level and \textbf{(2)} provide uncertainty guarantees across the whole set of claims. We approach problem \textbf{(1)} via \textit{calibration}, in which one wishes to output a calibrated score for each claim, while for problem \textbf{(2)}, we apply \textit{conformal prediction} \citep{shafer2008tutorial}, selecting a subset of claims that---with high probability---are \textit{all} correct.

In contrast to existing works on uncertainty guarantees of long-form generations \citep{quach2023conformal, mohri2024language}, we make the observation that while these guarantees may be valid under the full data distribution, they may not still be valid within individual subgroups of the distribution. For example, generations describing local politicians may be more prone to error than generations concerning national leaders. We choose biography generation as a testbed for multi-group uncertainty quantification, arguing that this problem is well-motivated, given that bias within biography generation has long been studied \citep{de2019bias}. Having derived a set of subgroups using demographic information  (e.g., whether an LLM output describes a man or woman), we find that when evaluated with respect to such groupings, canonical methods for calibration and conformal prediction indeed exhibit significant biases.\footnote{Uncertainty can be epistemic and aleatoric, and sources of group bias can be categorized into either (or both) types. Our work, which focuses on atomic factuality in long-form generation, falls under epistemic uncertainty.}

Having established such issues for standard uncertainty quantification approaches, we shift our attention to understanding to what extent such biases can be corrected. To address this unmet need, we introduce methods quantifying uncertainty in long-form text generation that are valid not only across a full distribution of prompts (i.e., marginally) but also across identifiable subgroups of prompts (i.e., conditionally). Invoking \textbf{(1)} \textit{multi}calibration \citep{hebert2018multicalibration} and \textbf{(2)}
\textit{multivalid} conformal prediction \citep{jung2022batch}, we categorize methods into two styles: iterative ``patching'' and linear regressor-based algorithms. 
% Specifically, we present these algorithms in their ``marginal'' and multi-group conditional forms. 

Our results demonstrate that for both problems \textbf{(1)} and \textbf{(2)}, multicalibration and multivalid conformal prediction techniques improve measures of uncertainty relative to standard (marginal) calibration and conformal prediction methods. This advantage holds \textbf{\textit{regardless}} of whether evaluation is conducted within groups or across the entire dataset. As the problems of calibration, conformal prediction, and their multi-group counterparts have not been extensively explored in the context of long-form text generation, we consider these results to form a benchmark for this setting.
% \footnote{To the best of our knowledge, among the four problems of calibration, conformal prediction, multicalibration, and multi-valid conformal prediction, only conformal prediction has been studied within the context of long-form text generation \citep{mohri2024language}.}


%  we study the more challenging setting in which these groups are intersecting, meaning that each entity can belong to multiple subgroups. 

\subsection{Related work}

\paragraph{Factuality in long-form LLM outputs.} Evaluating factuality for long-form generation \citep{min2023factscore, song2024veriscore, weilong, bayat2024factbench} is challenging: not only do generated outputs consist of many parts that must be scored individually, but also scoring each part requires prohibitively costly manual annotation. To make evaluation more tractable, \citet{min2023factscore} introduce \textsc{FActSCORE}, which converts any generation into a set of atomic facts (claims) that are then labeled as true or false. Using this evaluation metric, \citet{min2023factscore} test LLMs' abilities to generate biographies and find that their generations are pervaded with errors. 

% \paragraph{[Evaluation benchmark.]}The decomposition of long-form generation \citep{wanner2024closer} for biographies into atomic facts has led to a natural testbed for uncertainty quantification. 

% While not directly performing calibration, \citet{wang2024fine} demonstrate that uncertainty scoring using self-consistency approaches can be effectively used to filter out hallucinationed biographical facts. Similarly, \citet{mohri2024language} use the set of biographies proposed by \citet{min2023factscore} as one of their evaluation datasets to test their conformal prediction approach.

\paragraph{Attaching confidence scores to LLM outputs.} While a natural method for producing an uncertainty estimate is to use a model's output probabilities directly as a confidence score \citep{achiam2023gpt}, it has been shown that model probabilities are not well calibrated \citep{guo2017calibration}. As a result, many works have recently proposed alternative methods for generating uncertainty scores that can then be used to refine or correct LLM outputs \citep{ wang2022self, xiong2023can, geng2024survey, fadeeva2023lm, vashurin2024benchmarking}. 
%
We highlight that such work is complementary to our line of work---rather than proposing an entirely new uncertainty score function, we focus on how one can better leverage existing scores to produce uncertainty guarantees.

% \citet{xiong2023can}, for example, explore whether a model can express uncertainty directly through prompting. Meanwhile, \citet{wang2022self} draw inspiration from self-consistency, in which a score is defined over how often an answer appears over multiple responses the same prompt. 

\paragraph{Uncertainty quantification for LLMs.} 
We note that much of the prior work on multicalibration and multivalid conformal prediction are rooted in theory. Like \citet{detommaso2024multicalibration}, our work tries to bridge the gap between theoretical insights and practical problems today (i.e., LLM generations). However, while \citet{detommaso2024multicalibration} calibrate for correctness in question-answering, we are the first to apply multicalibration to claims decomposed from long-form text generation. Moreover, unlike \citet{detommaso2024multicalibration}, we consider uncertainty quantification in the form of conformal prediction. 

In addition, our work closely relates to \citet{mohri2024language}, which aims to provide high probability guarantees of factuality in long-form generation. In particular, \citet{mohri2024language} frame this problem as a nested conformal prediction problem, producing subsets of claims that achieve some marginally valid coverage (i.e., produce some generation that on average, contains a correct output with any user-specified probability). Our work, however, extends this problem to multivalid conformal prediction: we produce generations that are not only correct on average but are also conditionally correct across subgroups.
% , applying methods from works like \citet{gupta2022nested} to produce

Finally, concurrent work by \citet{cherian2024large} also builds off this framework, but unlike our work, they introduce a new objective in which the goal is to instead guarantee that at least some given proportion of claims are retained. By applying a method proposed in \citet{gibbs2023conformal}, \citet{cherian2024large}'s algorithm can (optionally) condition on group membership. However, their experiments include only 5 (non-overlapping) groups that are derived from the same feature, while our work focuses on the more challenging setting in which examples can simultaneously belong to many groups.
% output a calibrated probability while guaranteeing