\begin{abstract}

% While large language models (LLMs) are rapidly moving towards consumer-facing applications, they are often still prone to factual errors and hallucinations.
% %
% To this end, uncertainty quantification for factual correctness has garnered significant interest for many whose goal is to quantify such risks in long-form natural language generation.
%
While past works have shown how uncertainty quantification can be applied to large language model (LLM) outputs, the question of whether resulting uncertainty guarantees still hold within sub-groupings of data remains open.
%
In our work, given some long-form text generated by an LLM, we study uncertainty at both the level of individual claims contained within the output (via \textit{calibration}) and across the entire output itself (via \textit{conformal prediction}).
%
Using biography generation as a testbed for this study, we derive a set of (demographic) attributes (e.g., whether some text describes a man or woman) for each generation to form such ``subgroups'' of data. 
%
We find that although canonical methods for both types of uncertainty quantification perform well when measuring across the entire dataset, such guarantees break down when examining particular subgroups.
%
Having established this issue, we invoke group-conditional methods for uncertainty quantification---\textit{multicalibration} and \textit{multivalid} conformal prediction---and find that across a variety of approaches, additional subgroup information consistently improves calibration and conformal prediction within subgroups (while crucially retaining guarantees across the entire dataset).
%
As the problems of calibration, conformal prediction, and their multi-group counterparts have not been extensively explored in the context of long-form text generation, we consider these results to form a benchmark for this setting.

    
\end{abstract}

% previous versions
\iffalse
%v1
While large language models (LLMs) are rapidly moving towards consumer-facing applications, they are often still prone to factual errors and hallucinations. In order to reduce the potential harms that may come from these errors, it is important for users to know to what extent they can trust an LLM when it makes a factual claim. To this end, we study the problem of uncertainty quantification of factual correctness in long-form natural language generation. Given some output from a large language model, we study uncertainty both at the level of individual claims contained within the output (via calibration) and uncertainty across the entire output itself (via conformal prediction). Moreover, we invoke multicalibration and multivalid conformal prediction to ensure that such uncertainty guarantees are valid both marginally and across distinct groups of prompts. Using the task of biography generation, we demonstrate empirically that having access to and making use of additional group attributes for each prompt improves both overall and group-wise performance. As the problems of calibration, conformal prediction, and their multi-group counterparts have not been extensively explored previously in the context of long-form text generation, we consider these empirical results to form a benchmark for this setting.
\fi
