\newpage

\appendix

\onecolumn

\title{Multi-group Uncertainty Quantification for Long-form Text Generation\\(Supplementary Material)}
\maketitle

\section{Additional analysis and results}\label{appx:analysis}

We provide additional analysis of the methods studied in our paper. For conciseness, we conduct this analysis on methods applied to \textit{self-consistency} scores only. We note, however, that similar findings can be made when applying such methods to P(True) or verbalized confidence.

\paragraph{Calibration.}

\input{tables/asce_best_worst_groups}

In Section~\ref{sec:results}, we demonstrate that multicalibration methods (IGHB, GCULR) significantly outperform standard calibration methods (HB, PS) with respect to calibration error both marginally and and within groups (max and mean gASCE). In this section, we further examine group calibration error, specifically looking at which groups do multicalibration methods improve over marginal methods most. 

First, in Tables \ref{tab:asce_hb_best_worst_llama2} and \ref{tab:asce_hb_best_worst_mistral}, we compare HB to IGHB for outputs from \textsc{Llama 2 7B Chat} and \textsc{Mistral 8B Instruct}, calculating the difference $\Delta$ in ASCE between the two methods for each group. Interestingly, we find that IGHB improves over HB for every group. We note that this finding is expected when one has access to the true data distribution. In our case, we implement IGHB using the calibration set (since we do not have access to the true data distribution), suggesting that the distributions for our calibration and test sets are still close enough such that IGHB is able to achieve such a strong result. Therefore, we present in Tables \ref{tab:asce_hb_best_worst_llama2} and \ref{tab:asce_hb_best_worst_mistral} the top 5 groups in terms of improvement $\Delta$ of IGHB compared to HB. For reference, we also present results for the group with the smallest improvement (to show the minimum improvement of the method).

Next, in Tables \ref{tab:asce_reg_best_worst_llama2} and \ref{tab:asce_reg_best_worst_mistral}, we compare PS to GCULR for outputs from \textsc{Llama 2 7B Chat} and \textsc{Mistral 8B Instruct}. Upon initial inspection, we find that unlike for HB and IGHB, GCULR does not improve ASCE for every single group when compared to its standard variant, PS. Thus, in Tables \ref{tab:asce_hb_best_worst_llama2} and \ref{tab:asce_hb_best_worst_mistral}, we instead show the top and bottom 5 groups in terms of improvement $\Delta$ of GCULR over PS. As shown in these results, like IGHB, GCULR is able to improve ASCE by a large margin (top 5 $\Delta$). Moreover, we find that among groups (bottom 5 $\Delta$) where GCULR is not able to improve ASCE, the calibration errors of PS are already very small ($\le 0.0028$ for Llama 2 and $\le 0.0021$ for Mistral). While GCULR does worsen ASCE for these 5 groups, the mean difference $\Delta$ is only $0.0008$ and $0.0011$ for \textsc{Llama 2 7B Chat} and \textsc{Mistral 8B Instruct}, thereby achieving still small calibration errors. In comparison, when GCULR does correct ASCE for subgroups, it does so by large margin, with mean reduction in error of $0.0306$ and $0.0338$ respectively. Consequently, we still see large improvements for overall mean and max gASCE when comparing GCULR to PS (as shown in Table~\ref{tab:multicalibration_asce_nq} of the main body).

Finally, we note that in all tables Tables \ref{tab:asce_hb_best_worst_llama2}, \ref{tab:asce_hb_best_worst_mistral}, \ref{tab:asce_reg_best_worst_llama2}, and \ref{tab:asce_reg_best_worst_mistral}, we observe the same set of groups in top and bottom 5, sorted by difference in ASCE $\Delta$. For example, regardless of model (Llama vs Mistral) or algorithm type (iterative patching vs regression-based), the top 5 groups (and their order) are exactly the same. Similarly, we find that for both models, the group with the smallest improvement is \textbf{\# Wiki prop.} = \textit{Medium} \textcolor{ForestGreen}{\&} \textbf{nationality} = \textit{APAC}. These observations suggest that our findings are not unique to either the model choice or calibration algorithm type.

Looking specifically at which groups does multicalibration correct the most (top 5 $\Delta$), we see that our models are most miscalibrated w.r.t. groups where the \# Wikidata properties is low, suggesting that standard calibration methods (HB, PS) are miscalibrated when it comes to quantifying uncertainty for individuals whose information is not prevalent on the Internet (and therefore most likely do not appear as often in training data used to train LLMs today). Fortunately, however, incorporating group information (as is done in IGHB and GCULR) helps alleviate this issue (i.e., in Tables \ref{tab:asce_hb_best_worst_llama2} and \ref{tab:asce_hb_best_worst_mistral}, the mean ASCE of GCULR for the top and bottom groups is fairly close).

\paragraph{Conformal Prediction.}

\input{figures/conformal_nq_additional}

In Figure \ref{fig:conformal_nq_additional}, we provide additional information about the prediction sets outputted by our various conformal methods on \textsc{Bio-NQ}. On the left panel, we plot the empirical coverage achieved against the target coverage. Figure \ref{fig:conformal_nq_additional} demonstrates that all methods achieve the target (marginal) coverage. On the \textbf{middle} panel, we plot the fraction of biographies retained (i.e., non-abstentions) for each method against the target coverage level, while on the \textbf{right} panel, we plot the number of facts per biography retained. Generally speaking, all methods retain about the same number of facts per biography. We also observe that to achieve the same target coverage, with SC and MVSC generally retaining fewer biographies (i.e., more abstentions) when compared to CQR and GCCQR. However, when comparing each conformal method (SC and CQR) to their multivalid counterparts (MVSC and GCCQR), we again observe that there are very little differences between them.

To help illustrate how different conformal methods (e.g., standard conformal vs. the multivalid counterpart) affect the final output text (i.e., subsets of retained claims), we provide examples\footnote{Note that these examples are meant to be illustrative---measuring actual effectiveness of conformal prediction methods must done at the group or dataset level (e.g., Figure \ref{fig:conformal_nq}).} outputted by models on \textsc{Bio-NQ}. In Tables \ref{tab:examples_llama2_retain_more} and \ref{tab:examples_mistral_retain_more}, we demonstrate how multivalid conformal methods can produce sets with additional claims retained. Moreover, in some cases, standard conformal methods (SC, CQR) may produce empty sets (abstain) while their multivalid counterparts do not (Tables \ref{tab:examples_llama2_nonempty} and \ref{tab:examples_mistral_nonempty}).

\input{tables/conformal_best_worst_groups}

Finally, like in the section above, we again examine what groups do multivalid conformal methods improve over standard methods on the most, where in this case, we instead calculate the difference $\Delta$ in coverage error (at target coverage of $90\%$) between each pairing of conformal and multivalid conformal methods. In particular, we display the top and bottom 5 groups in terms of difference $\Delta$ in Tables \ref{tab:conformal_sc_best_worst_llama2}, \ref{tab:conformal_sc_best_worst_mistral}, \ref{tab:conformal_qreg_best_worst_llama2}, and \ref{tab:conformal_qreg_best_worst_mistral}

Our findings show that for the topic of factuality in long-form text generation, multivalid conformal prediction is a more challenging problem when compared to calibration. As shown in Figure \ref{fig:conformal_nq}, multivalid methods (GCCQR and MVSC) consistently (at all target coverages) outperform standard conformal methods (SC and CQR) w.r.t. group coverage error. Tables \ref{tab:conformal_sc_best_worst_llama2}, \ref{tab:conformal_sc_best_worst_mistral}, \ref{tab:conformal_qreg_best_worst_llama2}, and \ref{tab:conformal_qreg_best_worst_mistral} corroborate this finding, showing that the mean coverage difference $\Delta$ for the top groups is larger (at a minimum, $2.41$x more for MVSC and $3.95$x more for GCCQR), demonstrating that multivalid methods tend to improve coverage error on groups more than it worsens it (for other groups), thereby achieving a better mean group coverage error overall. However, the improvements are not as stark as those found in Tables \ref{tab:asce_hb_best_worst_llama2}, \ref{tab:asce_hb_best_worst_mistral}, \ref{tab:asce_reg_best_worst_llama2}, and \ref{tab:asce_reg_best_worst_mistral} for calibration error, suggesting that multivalid conformal prediction may be a harder problem overall.

When looking at which groups do multivalid conformal methods improve the most on, we find no consistent patterns. However, we do observe that all groups for which MVSC or GCCQR improve the most on are related to the number of Wikidata properties. Interesting, we do observe that CQR does quite poorly on groups containing people with a low number of Wikidata properties, mirroring our findings for calibration above. Like in multicalibration, GCCQR is able to significanly improve coverage error for these groups. Lastly, we note that CQR seems to achieve worse group coverage than that of SC. For example, on outputs from \textsc{Mistral 7B Instruct}, the mean coverage error among the top 5 groups is $0.0732$ for CQR compared to $0.0363$ for SC. However, we find that for both models (Tables \ref{tab:conformal_qreg_best_worst_llama2}, and \ref{tab:conformal_qreg_best_worst_mistral}), GCCQR is able to still reduce coverage errors to levels similar to that of MVSC (Tables \ref{tab:conformal_sc_best_worst_llama2}, and \ref{tab:conformal_sc_best_worst_mistral}).

\input{tables/conformal_examples/retain_more}
\input{tables/conformal_examples/nonempty}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\clearpage

\section{Additional experimental details}\label{appx:additional_exp_details}

%%%

\subsection{Biography generation and factuality evaluation}\label{appx:generating}

While the ground truth score must be human-annotated, \citet{min2023factscore} show that \textsc{FActSCORE} can be approximated by an automated process that leverages an LLM (i.e., ChatGPT and LLaMa-7B) and natural language retrieval. Following \citet{min2023factscore}, we also use an LLM to automate the annotation process. For some input person, \textbf{\texttt{[ENTITY]}}, we prompt a large language model with the following:

\noindent \textbf{\texttt{[INST] Question: Tell me a bio of [ENTITY]. [/INST]}}

We then decompose each long-form generation into a set of atomic facts, which are then checked against some set of Wikipedia articles about the \textbf{\texttt{[ENTITY]}} to evaluate overall performance of language model in terms of factuality. \citet{min2023factscore} demonstrate that while the evaluation process should ideally be conducted by human annotators, using large language models (i.e., ChatGPT and LLama 1 7B) to both decompose long-form generations and check against Wikipedia articles serves as a very good proxy for human annotation.

Following this general framework for automated evaluation, we use Llama 2 7B Chat to decompose each generation \textbf{\texttt{[GEN\_BIO]}} with the following prompt:

\noindent \textbf{\texttt{[INST] <<SYS>> Break down the following input into a set of small, independent claims. You must not add additional information. Output the claims as a numbered list separated by a new line. The subject of each line should be [ENTITY]. <</SYS>> Input: [GEN\_BIO] [/INST] }}

For checking each atomic fact against Wikpedia, we directly use the code released by \citet{min2023factscore}, which first conducts passage retrieval via Generalizable T5-based Retrievers \citep{liu2023evaluating} to find relevant articles from a dump of Wikipedia (dated 2023-04-01) and then prompts an LLM (i.e., ChatGPT or Llama 1 7B) to predict whether each fact is supported by the retrieved passages. For our evaluation, we again use Llama 2 7B Chat. Finally, these predictions are ensembled with predictions using likelihood estimates derived from a nonparametric masked language model \citep{min2023nonparametric}.

We note that for prompting the LLMs described above, we use \href{http://huggingface.co}{Hugging Face}'s transformer's library and generate responses with temperature set to $1.0$.

%%%

\subsection{Base scoring functions}\label{appx:base_scoring_fns}
    
\paragraph{Self-consistency.} Instead of manually annotating which claims are contained in the additional generations, we automate the process. Specifically, we use a procedure similar to frequency scoring algorithm proposed by \citet{wang2024fine} in which \textbf{(1)} a set of $K$ most relevant claims from a reference generation is retrieved using a vanilla BM25 algorithm (to reduce computational costs). Then \textbf{(2)} an LLM is tasked to evaluate whether the target claim is supported by the set of $K$ reference claims. In our work, we replace LLM prompting in step \textbf{(2)} with \textsc{AlignScore-Large} \citep{zha2023alignscore}, which runs significantly faster and is reported by \citet{zha2023alignscore} to compare favorably to LLM-based alignment methods.

\paragraph{P(True).} We use the following prompt:

\noindent \textbf{\texttt{[INST] <<SYS>> Answer the question based on your knowledge of the topic, [TOPIC]. If you are unsure about the question, output False. <</SYS>> Question: Is the following statement True or False? [CLAIM] [/INST]}}

\paragraph{Verbalized confidence.} We use the following prompt:

\noindent \textbf{\texttt{[INST] <<SYS>> Given a [TOPIC]: [CLAIM] pair as input, use your knowledge about [TOPIC] to rate (on an integer scale between 1 and 5) how confident you are that the input [CLAIM] is true. <</SYS>> [TOPIC]: [CLAIM] [/INST]}}

%%%

\subsection{Miscellaneous details}

\paragraph{Datasets.} Table \ref{tab:dataset_stats} reports additional information about datasets \textsc{Bio-NQ} and \textsc{Bio-FActScore}, including the number of entities and claims per biography outputted by each model. 

\input{tables/dataset_stats}

\paragraph{Group Attributes.}
For our experiments, we use the following group attributes:
%
\begin{itemize}[itemsep=-2pt]
    \item \textbf{\# Wikidata properties}: For each entity, we count the number of Wikidata properties and discretize them into the following buckets: $[0, 25), [25, 50), [50-100), [100, \infty)$. This group serves as proxy for the amount of information available online for some given entity.
    
    \item \textbf{nationality}: Following \citet{min2023factscore}, who use nationality derived from Wikidata to sample their dataset of human entities, we take the property \textit{country of citizenship} (or \textit{place of birth} when not available) and categorize the corresponding value into the following categories defined by \citet{min2023factscore}: Asia/Pacific, Europe/Middle East, North America, Latin/South America/Africa.
    
    \item \textbf{sex or gender}: We take directly the value for the Wikidata property, \textit{sex or gender}.

    \item \textbf{plays professional sports}: We check whether the Wikidata entry has the property, \textit{sport}.

    \item \textbf{has IMDb entry}: We check whether the Wikidata entry has the property, \textit{IMDb ID}, to use as a proxy for whether a person has been involved in films or television series.
\end{itemize}

In total, we have $|\mathcal{G}| = 77$ subgroups. To prevent extremely uncommon groups that may exist in the Wikidata database from biasing our results, we exclude groups of size $<5\%$ of the total test set size. Note that while we create groups using $1$ and $2$-way combinations for evaluation, we train the quantile regression models in CQR and GCCQR using only single attribute groups as features in order to reduce computation.

\paragraph{Hyperparameters.} For our patching algorithms IGHB and MVSC, we set the max iterations $T=100$. For training (multi)calibration, our logistic regression models are trained using default hyperparameters given my \href{https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html}{Sci-kit learn}. For training CQR and GCCQR, we run 5-fold cross validation for each target coverage $1-\alpha$ to optimize the $\ell_1$-penalty term $C \in \{ 10^{-6}, 10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}, 10^{-1} \}$

For \textsc{AlignScore}, we set $M=4$ and $K=5$. We found that \textsc{AlignScore} generally returns values close to $0$ or $1$, giving us self-consistency uncertainty scores around the $5$ values $\{ 0, \frac{1}{4}, \frac{1}{2}, \frac{3}{4}, 1 \}$. As a result, we evaluate all methods using $p=5$ level sets.

\paragraph{GPU requirements.} We use a NVIDIA A100 80GB GPU for all experiments. For obtaining results on all entities across \textsc{Bio-NQ} and \textsc{Bio-FActScore}, our experiments, per LLM require approximately the following:
%
\begin{itemize}[noitemsep]
    \item Generating biographies (+ 4 additional generations for getting frequency scores): 15 hours (x5)
    \item Splitting atomic facts (+ 4 additional generations for getting frequency scores): 30 hours (x5)
    \item Checking facts against Wikipedia: 75 hours (x1)
    \item Calculating frequency scores via AlignScore: 10 hours (x1)
\end{itemize}

\paragraph{Licenses.} Wikidata and Wikipedia are licensed under the Creative Commons CC0 License. Llama 2 7B is licensed under Meta's Llama 2 license. Mistral 7B Chat and Hugging Face's transformers library are licensed under Apache 2.0 license. We also make use of \href{https://github.com/shmsw25/FActScore}{code} released by \citet{min2023factscore} under the MIT license.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\clearpage

\section{Evaluating on entities used in Min et. al (2023a)}

In addition to evaluating on our dataset, \textsc{Bio-NQ}, we construct an additional dataset using the $683$ entities used by \citet{min2023factscore} for their empirical evaluation. We denote this dataset as \textsc{Bio-FActScore} and evaluate all methods using \textit{self-consistency} as the base scoring function.

\paragraph{Calibration.} In Table \ref{tab:multicalibration_asce_fs}, we observe similar results to that on \textsc{Bio-NQ}---namely, multicalibrated counterparts (IGHB and GCULR) perform better than their base counterpart (HB and PS). However, we note that for Mistral 7B Instruct, PS performs the best when looking at marginal ASCE. We hypothesize that the smaller gap in ASCE between PS and GCULR may be due to the smaller training size of \textsc{Bio-FActScore} (25,283 claims), which is roughly 10x smaller than that of \textsc{Bio-NQ} (297,714 claims). Lastly, with respect to Brier score, multicalibration still dominates across all metrics (Table \ref{tab:multicalibration_brier_fs}).

\input{tables/multicalibration_asce_fs}
\input{tables/multicalibration_brier_fs}

\paragraph{Conformal Prediction.} For \textsc{Bio-FActScore}, we observe that multivalid conformal methods \textit{do not} improve performance across subgroups. In Figure \ref{fig:conformal_fs}, we observe very little difference in mean coverage error across groups. We hypothesize, however, that this negative result again is due to the smaller dataset size. In this case, our number of examples is the number of biographies in the dataset (683), giving us a calibration set size of 546 and test set size of 137. Further dividing the calibration and test sets into subgroups, it is possible there could simply not be enough examples per group for the distribution on the calibration set to generalize to the test set. Comparing the left panels of Figures \ref{fig:conformal_fs_additional} to \ref{fig:conformal_nq_additional}, we also find that even when looking at marginal coverage, all methods perform worse (the lines deviate more from $y=x$), likely due again to the small calibration and test size.

\input{figures/conformal_fs}
\input{figures/conformal_fs_additional}
