\section{Preliminaries}

% Next, we present calibration and conformal prediction, as well as their multi-group versions, multicalibration and multivalid conformal prediction.

\subsection{Calibration}\label{sec:calibration}

We begin by defining calibration in context of factuality in open-ended text generation. Suppose we are given some $(X,Y) \sim \mathcal{D}$ where $X \in \mathcal{X}$ denotes some claim outputted by an LLM, while $Y$ is an indicator in which $Y=1$ when the claim is correct (and $Y=0$ otherwise). Suppose there exists some uncertainty score function $f: \mathcal{X} \to [0, 1]$ that measures confidence for the correctness of some input $X$ (with higher values denoting higher levels of confidence). Then a goal one may have when designing such a score function $f$ is to have that
%
\begin{equation}\label{eq:perfect}
    P_{\mathcal{D}} (Y=1 \mid f(X) = p) = p, \forall x \in \mathcal{X}
\end{equation}
%
In other words, the probability that some LLM output is correct is given exactly by $f$.

Calibration, then, defines a simpler, more tractable condition, in which instead of ensuring guarantees across all possible values of $f$, it ensures a guarantee over coarser, level sets $S_p(f)$: 
%
\begin{definition}\label{def:calibration}
(Calibration) A function $f$ is calibrated w.r.t $D$ if
\begin{equation*}
    \Delta_p(f) = 0, \forall p \in [0, 1]
\end{equation*}
where $\Delta_p(f)$ is the bias of $f$ for the p-th level set  $S_p(f) = \{f(x) = p\}$:
\begin{equation*}
    \Delta_p(f) = \mathbb{E}_{\mathcal{D}}[Y - f(X) \mid S_p(f)]
\end{equation*}
\end{definition}

Defining level sets is akin to dividing the output space of $f$ (i.e., $[0,1]$) into buckets. For example, one could round $f(X)$ to the nearest value in some predefined set of probabilities (e.g. $\{0, 0.5, 1.0\}$). One can view this definition of calibration as a desirable guarantee since it serves as a minimal condition for Equation \ref{eq:perfect}---any $f$ that satisfies \eqref{eq:perfect} must (at the very least) also be calibrated. We note that to evaluate calibration, we can consider the average squared calibration error (ASCE) of $f$.
%
\begin{equation}\label{def:asce}
    \textrm{ASCE}(f) = \mathbb{E}_P [\Delta^2_{P}(f)]
\end{equation}
%
The ASCE averages the squared bias across all level sets and is zero when $f$ is calibrated.

\paragraph{Multicalibration.} %\label{sec:multicalibration}

While calibration provides an already important and useful guarantee, it can often be insufficient in many real-world scenarios. For example, in the context of generating information about people, one maybe desire that $f$ is calibrated not only across all people, but also within subpopulations defined by demographic attributes like \textit{sex or gender}. Otherwise, it is possible that certain subgroups can still suffer from very high miscalibration, even when the score function is perfectly calibrated across $\mathcal{D}$. Ideally, one would hope to have guarantees while conditioning on as many subgroups in $\mathcal{X}$ as possible, both from the perspective of machine learning fairness as well as enhancing the likelihood of correctness in general. 

Multicalibration \citep{hebert2018multicalibration} was developed to provide accurate guarantees across overlapping subgroups (i.e., a sample can belong to many groups). Let $g: \mathcal{X} \to \{0, 1\}$ be a group function that evaluates to $1$ if $X$ belongs to some group. We study, then, the setting in which there exists of set of groups $\mathcal{G}$ that corresponds to our data domain $\mathcal{D}$. While the set of groups can be disjoint, the problem of multicalibration then becomes trivial in this case because one can simply split a dataset into disjoint sets that can then each be calibrated individually. Consequently, prior work typically considers the more interesting case where many intersecting groups comprise $\mathcal{G}$.

Given a group function $g$, we define group average squared calibration error (gASCE) as:
%
\begin{equation}\label{def:gasce}
    \textrm{gASCE}(f, g) = \mathbb{E}_P [\Delta^2_{p,g}(f) \mid g(X) = 1]
\end{equation}
%
where 
\begin{equation*}
    \Delta_{p,g}(f) = \mathbb{E}_{\mathcal{D}}[Y - f(X) \mid S_{p, g}(f)]
\end{equation*}
for $S_{p, g} = \{f(X) = p, g(x) = 1\}$.
%
In other words, gASCE conditions on both level sets and group membership. Finally, we have:
%
\begin{definition}\label{def:multicalibration}
(Multicalibration) A function $f$ is $\alpha$-multicalibrated w.r.t $D$ and a set of groups $\mathcal{G}$ if and only if
\begin{equation*}
    gASCE(f, g) < \frac{\alpha}{P_{\mathcal{D}}(g(X) = 1)},\forall g \in \mathcal{G}
\end{equation*}
\end{definition}

\subsection{Conformal Prediction}\label{sec:conformal}

In conformal prediction, the general goal is to produce some confidence set $\mathcal{T}(X)$ for some example $X$ such that this set marginally \textit{covers} the true label $Y$ with some target probability $1-\alpha$.
%
\begin{equation}\label{eq:conformal}
    P_{\mathcal{D}} (Y \in \mathcal{T}(X)) = 1 - \alpha
\end{equation}

The second part of our work follows the problem statement outlined in \citet{mohri2024language}. Unlike in calibration, where each claim contained in some long-form generation is treated individually, \citet{mohri2024language} instead define their problem in terms of pairs $(X, Y)$, where $X$ is some input prompt and $L(X) = Y \in \mathcal{Y}$ is the long-form generation outputted by a LLM $L$. Because $Y$ may or may not be supported by some reference ground truth $Y^*$,\footnote{in the case of FActScore \citep{min2023factscore}, "is $Y$ supported by Wikipedia?"}
%
\citet{mohri2024language} define factuality in terms of entailment operations $Y^* \implies Y$. Furthermore, they rewrite this relation as $Y^* \in E(Y) = \{ Y' \in \mathcal{Y}: Y' \implies Y \}$. This equivalent set notation, in other words, means that some reference ground truth $Y^*$ (e.g., a Wikipedia article in \citet{min2023factscore}) is contained in the set of possible texts $Y'$ that support all claims made in the LLM output $Y$. 

Given this notation, the goal is to find some uncertainty set $\mathcal{T}(L(X))$ s.t. $P_{\mathcal{D}} (Y \in \mathcal{T}(L(X))) = 1 - \alpha$. In the context of long-form text generation, this goal translates to taking as input the original LLM output $L(X)$ and producing a subset of claims $\mathcal{T}(L(X))$ such that with high probability, $1 - \alpha$, all remaining claims are factually correct. 

We note that to empirically measure such guarantees, one can use the \textit{coverage error} of $\mathcal{T}$ w.r.t the target error rate $\alpha$.
%
\begin{equation}
    | P_{\mathcal{D}} (Y \notin \mathcal{T}(X)) - \alpha |
\end{equation}
%
% the absolute difference between the target coverage $1-\alpha$ and the actual proportion of uncertainty sets $\mathcal{T}(L(X))$ that are entirely factual.

\paragraph{Multivalid Conformal Prediction.}

Similar to calibration, one may also desire group conditional coverage guarantees for intersecting groups. Known as \textit{multivalid conformal prediction} \citep{jung2022batch}, these guarantees are stronger than marginal conformal guarantees, holding also when conditioned on group membership. Using group functions $g$, as defined in Section \ref{sec:calibration}, full multivalid coverage can be written as the following: Given some set of groups $\mathcal{G}$, we have that 
%
\begin{equation}\label{eq:multi_conformal}
    P_{\mathcal{D}} (Y \in \mathcal{T}(X) \mid g(X) = 1) = 1 - \alpha
\end{equation}
%
for all group functions $g \in \mathcal{G}$. Thus, target coverage guarantees $1-\alpha$ must hold both marginally and within all subgroups.
