\section{Empirical Results}\label{sec:results}

% \input{tables/multicalibration_asce_nq}
% \input{tables/multicalibration_brier_nq}
\input{figures/conformal_nq}


To assess the efficacy of the methods introduced in Section~\ref{sec:methods}, we present results for the task of biography generation using outputs from Llama 2 7B Chat \citep{touvron2023llama} and Mistral 7B Instruct v0.2 \citep{jiang2023mistral}. We randomly split the entities into 80-20 calibration-test splits, averaging results over 10 randomly generated splits.
% Table \ref{tab:dataset_stats} reports the number of entities for which each model generates biographies and the number of claims contained in each generated biography.

We stress that the primary goal (and novelty) of our work is to evaluate---as it pertains to metrics for uncertainty quantification---the (1) efficacy and failures of marginal methods  and (2) the extent to which multi-group methods improve over them. In supplementary results found in Appendix \ref{appx:analysis}, we include additional analysis comparing how marginal and group-conditional methods behave. Tables  \ref{tab:examples_llama2_retain_more}, \ref{tab:examples_mistral_retain_more}, \ref{tab:examples_llama2_nonempty}, and \ref{tab:examples_mistral_nonempty} of the appendix provide examples of the types of outputs produced by both marginal and multivalid conformal methods.

\paragraph{Calibration.} 

% To evaluate fact-level uncertainty, we consider both ASCE and Brier score,\footnote{While not a direct measure of (multi)calibration like ASCE, the Brier score is still useful in certain settings for quantifying the efficacy of the algorithms we consider.}
% %
% which is the mean squared error between the uncertainty score function $f(X)$ and the true label $Y$. As discussed previously, we generate biographies for entities from \textsc{Bio-NQ} and apply the techniques discussed in Section \ref{sec:methods}. 


 In Table \ref{tab:multicalibration_asce_nq}, we report ASCE, max gASCE, and mean gASCE, comparing each calibration method (HB, PS) against its multicalibration counterpart (IGHB, GCULR) across various base scoring functions. We find that marginal calibration methods (HB, PS) are able to correct the uncalibrated uncertainty scores, significantly decreasing ASCE. However, when examining max and mean gASCE, we find that these methods do not ensure strong guarantees for uncertainty when evaluating within different subgroups. The discrepancy is particularity large in cases where the marginal method performs well w.r.t. marginal ASCE. For example, when applying PS to self-consistency scores and comparing ASCE to max group gASCE, we observe that there exists some subgroup for which the calibration error is approximately \textbf{263x} and \textbf{198x} (for Llama and Mistral output respectively) worse on that particular subgroup compared to the dataset as a whole.
 
In contrast, the multicalibration variants of both the patching (IGHB) and linear regression (GCULR) techniques significantly outperform HB and PS in terms of max and mean gASCE across all experimental settings. Our results provide strong evidence that regardless of the model, base scoring function, or algorithm type, incorporating information about the subgroups in some meaningful way will substantially correct biases that marginal methods exhibit.

Even more surprising is that when considering just (marginal) ASCE across the entire dataset, incorporating group features improves performance as well. Aside from the experiments calibrating verbalized confidence scores on Llama 2 7B Chat generations, where HB and PS perform particularly well, the multicalibration variant outperforms the marginal method every time, with GCULR being the best method in almost all cases. While improving marginal calibration is not the primary focus of our work, these results suggest that even if one does not specifically require parity for specific subgroups, collecting additional group features and applying multicalibration (as opposed to vanilla calibration) can still be extremely beneficial for generating better-calibrated uncertainty scores. 

Finally, to evaluate fact-level uncertainty more holistically, we also consider the Brier score, which is the mean squared error between the uncertainty score function $f(X)$ and the true label $Y$. While not a direct measure of (multi)calibration like ASCE, the Brier score is still useful in certain settings for quantifying the efficacy of the algorithms we consider, quantifying desirable properties of calibration that are not captured by calibration error \citep{brocker2009reliability, liu2025calibrating}. In Table \ref{tab:multicalibration_brier_nq}, we report the Brier score, also both marginally and across groups. Similar to our analysis of ASCE, we again see that standard calibration methods exhibit failures when evaluating at the group level. However, IGHB and GCULR outperform HB and PS respectively across all metrics.

\paragraph{Conformal prediction.} For the problem of uncertainty at the biography level, we apply the vanilla conformal prediction methods SC and CQR and their multivalid counterparts, MVSC and GCCQR.\footnote{To compare methods qualitatively, we provide illustrative example outputs in Appendix \ref{appx:analysis}, Tables \ref{tab:examples_llama2_retain_more}, \ref{tab:examples_mistral_retain_more}, \ref{tab:examples_llama2_nonempty}, and \ref{tab:examples_mistral_nonempty}.}
%
We choose target coverages of between $0.5$ to $0.9$, evaluating on biographies generated by Llama 2 7B Chat and Mistral 7B Instruct.\footnote{Although we evaluate on a wide set of target coverages $1-\alpha$, conformal prediction makes more sense only for higher target coverages (e.g., $0.8$ or higher), since lower coverage guarantees can often be too weak to be useful in practice.}


We corroborate \citet{mohri2024language}'s findings that (standard) conformal prediction methods are able to achieve close to perfect coverage on biography generation. Specifically, we show that both SC and CQR achieve target coverages (Appendix \ref{appx:analysis}, Figure \ref{fig:conformal_nq_additional}). Moreover, there is little difference between the two in terms of the average number of abstentions and facts per biography retained. 
%
However, when evaluating coverage across individual subgroups, we find that both methods have some level of error. In Figure \ref{fig:conformal_nq}, we compare the mean absolute coverage error across all subgroups for each target coverage and find that SC and CQR exhibit high mean errors (of up to $0.1$ in some cases), despite achieving almost no error when evaluated (marginally) across the entire dataset (Figure \ref{fig:conformal_nq_additional}). 

Again, we investigate whether incorporating subgroup information can correct these biases.
Here, the message is clear---multivalid conformal methods improve coverage error at the group level, regardless of the model, base scoring function, or algorithm type (Figure \ref{fig:conformal_nq}). We note however that we do not observe the same performance gains as found for calibration (Table \ref{tab:multicalibration_asce_nq}), where group-conditional methods sometimes outperform marginal ones by an entire order of magnitude. This finding may result in part due to the smaller calibration set or the possibility that (multivalid) conformal prediction for LLMs is a more challenging problem. We leave further investigation of this observation to future work.

 % Comparing the multivalid methods of both the iterative patching (MVSC) and linear regression (GCCQR) to their standard conformal techniques (SC and CQR), we do not observe the same performance gains as found for calibration (Table \ref{tab:multicalibration_asce_nq}) in which group-conditional sometimes outperform marginal methods by an entire order of magnitude, suggesting that (multivalid) conformal prediction for LLMs may be a more challenging problem.

% \footnote{We further expand upon this hypothesis in Appendix \ref{appx:analysis}.}
