\section{Conclusion}

In this paper, we conduct an extensive study on uncertainty quantification for long-form text generation. We focus on two forms of uncertainty---claim-level (calibration) and biography-level (conformal prediction)---and present a variety of methods for these settings. 
%
We empirically validate that marginal methods for calibration and conformal prediction perform well when evaluated across the entire dataset. However, when looking at subgroup performance, we find that performance consistently degrades.
%
Introducing two categories of algorithms (iterative patching and linear regression), we demonstrate that by accounting for additional groups, multicalibration and multivalid conformal prediction methods correct the aforementioned biases of marginal-guarantee counterparts. We consider these empirical results to establish a benchmark for this setting and hope that our findings will motivate future work in this area.

\iffalse

\section{Limitations}

Past works on uncertainty quantification for NLP have made great strides in providing confidence scores for text generation, and our work furthers this line of research. However, we note that there exists certain limitations in our empirical findings. First, while \citet{min2023factscore} demonstrate the effectiveness of using an LLM to both decompose generations into atomic facts and assess the factuality of these claims, this automated process serves only as a proxy for the gold standard of human annotation. Moreover, like in \citet{min2023factscore}, our assessment of factuality is limited to whether claims are supported by Wikipedia, which again serves only as a proxy for factuality.

% In addition, while we found that using frequency/self-consistency scores as our base scoring function works well for both calibration and conformal prediction for generating biographies, it is ineffective for distinguishing between true and false claims when frequency scores are low (Figure X). For example, when using these scores as the basis for conformal prediction, many claims (or entire generated biographies) end up being filtered when one wishes to achieve high target coverages greater (such as those presented in the results found in Section \ref{sec:results}. Our experiments demonstrate then that there still exists a strong need for scoring functions (or language models in general) that are better correlated with factuality.

% Due to computational constraints, we evaluated on smaller versions of  did not evaluate on larger models, especially those with more parameters. Also, i
In our experiments, we chose groups based on attributes that are general (and obtainable) for any public figure (i.e., Wikidata properties). This set of groups is non-exhaustive, and there may be many other sets of groups one can consider to be relevant for various problem domains. While we do not foresee our general findings to change when adding additional subgroups, further investigation must be conducted to confirm this hypothesis.

% In addition, while we follow prior work and focus on the problem of generating biographies, our uncertainty techniques extend broadly to any long-form generation task where information contained in the output can be broken down into independent claims. Furthermore, the automated evaluation pipeline proposed by \citet{min2023factscore} is still valid, as long as the entities one prompts the LLM for a description of can be found on Wikipedia (i.e., animals, places, historical events, etc.). Consequently, in future work, we hope to validate our findings for other long-form generation tasks.

Finally, our experiments are limited to English text generation. In future work, we plan to extend our study to the multilingual setting to address this limitation.

\section{Potential Risks}

Regarding potential risks, as mentioned previously, large language models can have a tendency to hallucinate. Moreover, deep learning models in general may exhibit or perpetuate bias. In an attempt to better quantify uncertainty of LLMs, our work aims to help alleviate the potential risks of models producing false information. Moreover, our emphasis on multi-group guarantees aligns with certain definitions of fairness. However, our work may not entirely mitigate such issues, and other potential malicious or unintended harmful effects could still persist in the outputs of the calibration and conformal methods we evaluate.

\fi