\input{tables/multicalibration_asce_nq}
\input{tables/multicalibration_brier_nq}

\section{Empirical evaluation}

We focus our empirical evaluation on the problem of biography generation, which we contend serves as a very suitable testbed for evaluating factuality and has been used as a benchmark in a variety of works in recent years. Outputting biographies offers one the ability to evaluate not only a set of objective and specific claims but also on a wide range of topics, which in turn allows us to explore a rich set of group functions for each person. Moreover, bias within biography generation has long been a studied issue, further motivating the problem of ensuring group-conditional guarantees. 
%
Like \citet{min2023factscore}, we use a language model to automate the process of decomposing biographies into claims and evaluating for factuality (Appendix \ref{appx:generating}).
% Details can be found in Appendix \ref{appx:generating}.

\paragraph{Dataset}

We evaluate on a large set of biographies by extracting 8,541 entities from the Natural Questions dataset \citep{kwiatkowski2019natural}, which consists of real queries issued to the Google search engine. We denote this dataset as \textsc{Bio-NQ}. Our motivation for choosing Natural Questions is that these extracted human entities should serve as a representative sample of public figures that users may prompt an LLM about. For each question, we select all entities in either the question's short answer or accompanying Wikipedia article. We then attempt to match them to their corresponding Wikidata entry. If a match exists and its Wikidata page's property, \textit{if instance of}, is equal to the value, \textit{human}, we add the entity to our dataset \textsc{Bio-NQ}.

\paragraph{Collecting group features}

To obtain groups for each person found in our dataset, we extract properties by scraping Wikidata for each entity and identifying ones that are commonly shared among entities in \textsc{Bio-NQ}. The exact group attributes we use in our experiments are described in Appendix \ref{appx:additional_exp_details}. To form groups $\mathcal{G}$ from these attributes, we take all $1$ and $2$-way combinations of attributes and the values they take on, giving us $|\mathcal{G}| = 77$ subgroups.

\paragraph{Generating confidence scores}

The algorithms described in Section \ref{sec:methods} require a base scoring function. For experiments, we use the following:

\begin{enumerate}
    \item \textbf{Self-consistency} \citep{wang2022self}: Our first score is a frequency-based scoring function inspired by \textit{self-consistency}. To score each claim found in a generated biography, we prompt the LLM to output a biography $M$ additional times. We use the proportion of times the claim is contained in the additional reference generations as the uncertainty score. We automate the calculation of this score using BM25 and AlignScore \citep{zha2023alignscore} (See Appendix \ref{appx:base_scoring_fns}).

    \item \textbf{P(True)} \citep{kadavath2022language}: For each biographical claim, we prompt the LLM to assess whether it is true or false. We then output the ratio of next token probabilities of the tokens for ``true'' and ``false'': $\frac{P(True)}{P(True) + P(False)}$. 

    \item \textbf{Verbalized confidence} \citep{tian2023just}: To output \textit{verbalized confidence} as an uncertainty estimate, one prompts the LLM to directly output their confidence level in its response. We originally tried having the model rate its confidence numerically (e.g., output an integer between 1-5, 1-10, 1-100, etc.). However, we found these base scores to be somewhat unreliable. Instead, we ask the LLM to rate its confidence in each individual claim using integers between 1 and 5. We then output a weighted sum of the next-token probabilities for the tokens ``1'' through ``5'': $\sum_{r=1}^5 r \times P(r)$.
\end{enumerate}


