\section{Methods} \label{sec:methods}

Next, we introduce the methods (and their group-conditional variants) for applying calibration and conformal prediction to language model factuality. We organize these methods into two categories: \textbf{(1)} iterative ``patching''-based algorithms and \textbf{(2)} linear regressor algorithms. As mentioned previously, prior exploration of long-form text generation has been limited. While \citet{mohri2024language} evaluate one variant---split conformal (SC)---on a small set of entities, we are not aware of prior work that has considered other uncertainty quantification methods in this setting.

\subsection{Iterative ``patching'' algorithms}

The first category of algorithms can be characterized as patching algorithms. Given a base method for calibration or conformal prediction, one iterates through groups $g \in \mathcal{G}$ in which the method does poorly on. At each iteration, the algorithm corrects the bias (i.e., patches up the function) on just that subset of examples (i.e., $g(x) = 1$). Once some stopping condition is met,\footnote{In the standard formulation of iterative patching, the stopping criteria is set as a function over the number of bins so that one can prove guarantees about algorithm (see \citet{roth2022uncertain}). In practice, we found this stopping criteria to be too conservative, and so we instead run iterative patching on the calibration and test sets concurrently and use the calibration set to determine the stopping iteration (i.e., we enforce early stopping once we can no longer make improvements on the calibration set).} the final, "patched up" function satisfies multi-group guarantees.

\paragraph{Calibration.}

For calibration, we consider \textit{Histogram Binning} (HB) \citep{zadrozny2001obtaining}, presented in Algorithm \ref{alg:hb}. This method, takes some base scoring function $f$ and discretizes the output space to a set of $p$-th level sets $S_p(f)$, as defined in Section \ref{sec:calibration}. Given some target grid of values $p\in{[\frac{1}{m}]}$, we round $f$ to the closest value in the grid
\begin{equation*}
    f'(x) = \argmin_{p\in{[\frac{1}{m}]}} | f(x) - p |.
\end{equation*}
%
Algorithm \ref{alg:hb} then applies a constant correction\footnote{In Algorithms \ref{alg:hb} and \ref{alg:ighb}, we assume true data distribution is given, and therefore we can calculate $\Delta_{p,g}$. In practice (and our experiments) $\Delta_{p,g}$ is estimated using a calibration set.} for each level set $S_p(f)$ in the grid, based on the calibration error of the model $f'$.

In Algorithm \ref{alg:ighb}, we present the multi-group version of histogram binning, known as \textit{Iterative Grouped Histogram Binning} (IGHB) \citep{hebert2018multicalibration}. In this algorithm, we instead apply a constant correction conditioned on $S_{p, g}$ (i.e., both the level set and group membership). At each step $t$, IGHB identifies $S_{p, g}$ for which the calibration error (weighted by the group size) is highest and then corrects it for this level set \textit{and} group. The algorithm then continues until some stopping condition is met, iteratively patching $f'$ for various groups $g \in \mathcal{G}$.
% \footnote{Note that this method is iterative because groups are often intersecting so that any example $X$ may belong to many different groups. In the less interesting case where groups are not intersecting, one can patch all groups and level sets in one shot (in other words, apply histogram binning (Algorithm \ref{alg:hb}), but replacing $S_p(f)$ with $S_{p,g}(f)$.}

\input{docs/algos/hb}
\input{docs/algos/ighb}

\paragraph{Conformal prediction.} We first present the \textit{Split Conformal} (SC) method \citep{shafer2008tutorial, gupta2022nested}. In particular, we consider the standard approach where one constructs a set of nested sets and each output set contains some subset $\mathcal{F(X)}_{t}$ of claims generated by the LLM.

Following \citet{mohri2024language}, we define these nested sets $\mathcal{T}$ as thresholds sets where each set $\mathcal{F}(L(X))$ contains the set of all individual claims $\{ x \in L(X) \mid f(x) > t \}$ for some scoring function $f$. More formally, we have that $\mathcal{F}(L(X))_{t \in \mathcal{T}}$ satisfies the nested sequence property if for $t, t' \in \mathcal{T}, t \le t'$, we have that $\mathcal{F}_t(L(X)) \subseteq \mathcal{F}_{t'}(L(X))$.
% \footnote{The same scoring function $f$ that one may wish to (multi)calibrate can also be used for split conformal.}

To construct these threshold sets, we have that
%
\begin{equation*}
    r(X,Y) = \inf \{ t \in \mathcal{T}, Y \in \mathcal{F}_t(L(X)) \}
\end{equation*}
%
where $r$ defines the minimum safe threshold such that $Y \in \mathcal{F}_t(L(X))$ for all $t > r(X,Y)$. Practically speaking, given some set of uncertainty scores $f(x)$ for each claim $x \in L(X)$, $r(X,Y)$ defines the minimum value such that any set of claims $\mathcal{F}_t(L(X)) = \{ x \in L(X) \mid f(x) \ge t \}$ will be entirely true if and only if $t \ge r(X,Y)$.

Given some calibration set $\hat{D}$ of size $n$ and some target error rate $\alpha$ (or target coverage $1-\alpha$), split conformal simply outputs the set $\mathcal{F}_{q_\alpha}(L(X))$ for any $X$, where $q_\alpha$ is the $\frac{\lceil (n+1)(1-\alpha) \rceil}{n}$th-quantile of scores $\{r(X_i, Y_i)\}_{i=1}^n$ for $X_i,Y_i \in \hat{D}$.

In Algorithm \ref{alg:mvsc}, we present the \textit{multivalid split conformal} (MVSC) prediction technique that closely resembles methods originally proposed in \citet{jung2022batch}. Similar to IGHB, we start with some base threshold (i.e., the threshold $q_\alpha$ obtained from using split conformal). Then at each iteration $t$, we find the group $g_t$ that has the worst squared coverage error $\Delta_{t,g}$, weighted by the size of the group $P(g_t(X) = 1)$. Then, we simply "patch" the thresholds for examples $\{ (X,Y) \mid g_t(X) = 1\}$, again using the $\frac{\lceil (n+1)(1-\alpha) \rceil}{n}$th-quantile of scores for $(X,Y)$ belong to group $g_t$. Like in IGHB, we continue patching the set of thresholds until some stopping criterion is met.

\input{docs/algos/sc}


\subsection{Linear regressor algorithms}

Next, we consider algorithms that instead solve an optimization problem for the purpose of calibration and conformal prediction. In these cases, one can naturally make them multi-group/valid by including group-membership (i.e., $g(X) = 1$ for all $g \in \mathcal{G}$) in the optimization problem itself. Formally, we describe these linear regression based methods in Algorithms \ref{alg:linear_regressor} and \ref{alg:group_regressor}. Presented in this way, the methods for calibration vs. conformal prediction is reduced to a choice of loss function $L$. Again, we assume one has access to some calibration set for which one solves the optimization problem on.

\paragraph{Calibration.} For calibration, one can choose $L$ to be binary cross-entropy loss. In doing so, Algorithm \ref{alg:linear_regressor} then describes \textit{Platt Scaling} (PS) \citep{platt1999probabilistic}, which can be described as fitting a logistic regression model to some set of model outputs to obtain calibrated probability scores.\footnote{A related calibration method to Platt scaling (PS) is temperature scaling (TS) \citep{guo2017calibration}, which was originally introduced for calibrating neural networks for multiclass classification and has been incorporated in work on calibrating NLP models \citep{sicilia2024deal}. We note, however, that in the binary classification setting (e.g., our setting where we identify if an output is correct or not), TS is mathematically equivalent to PS when there is no bias term and the weight takes on the form $\frac{1}{\tau}$, where $\tau$ is the temperature learned in TS.}
%
Algorithm \ref{alg:group_regressor} describes the multi-calibrated version of Platt Scaling. While not explicitly derived in their work, this multicalibration formulation can be traced back to \citet{gopalan2022low}, who establish a hierarchy of notions for multicalibration and analyze multicalibration on functions trained with linear loss. Going forward, we refer to this method as \textit{Group Conditional Unbiased Logistic Regression} (GCULR).

\paragraph{Conformal prediction.} For conformal prediction, we instead choose $L$ to be pinball loss. We refer to the non-group version of this method (Algorithm \ref{alg:linear_regressor}) as \textit{Conformalized Quantile Regression} (CQR) \citep{romano2019conformalized}, in which given some target coverage $1 - \alpha$, we fit a linear quantile regression model that minimizes pinball loss. 

In our conformal prediction setting, as described in Section \ref{sec:conformal}, $X$ is an entire biography, or set of independent claims. Thus, to adapt quantile regression to long-form generation, we propose setting $f(X)$ to be a vector of uncertainty scores for each claim $x \in X$. Like in split conformal, the target is then the minimum threshold $r(X,Y)$ for which all claims above it are correct. In the multivalid case, we then add group features $g(X)$ to the optimization problem. A version of Algorithm $\ref{alg:group_regressor}$ was first presented by \citet{jung2022batch}, and going forward, we will refer to this method as \textit{Group Conditional Conformalized Quantile Regression} (GCCQR).

We note that in our experiments, each biography generated by the LLM may have a different number of claims, a setting in which prior work on conformal quantile regression does not account for. Consequently, we propose using interpolation to (un)squeeze the set of scores to a vector $f(X)$ of fixed size ($K=25$ in our experiments). While \citet{mohri2024language} only show that split conformal can be applied to this type of setting, our experiments demonstrate that quantile regression methods achieve similar performance for marginal (CQR) and multigroup (GCCQR) methods (Section \ref{sec:results}).

\input{docs/algos/reg}