
\section{Background and Problem Setup}

% We describe the problem setting, and provide a brief overview of conformal prediction, and its application to classification problems. We then discuss the challenges of applying conformal prediction to generative models with non-enumerable, unordered output spaces, and review existing methods that address this problem. \devjeet{TODO: }

%In this section, we first provide some background on conformal prediction (CP) and the challenges of applying CP for generative tasks. Next, we describe our problem setup and contrast it with the closest prior work for this setup. 
This section provides background and describes the formal problem setup.
We use capital letters $(X, Y)$ for random variables and their lowercase $(x, y)$ for specific values. %these variables take.

\subsection{Conformal Prediction Background}
\vspace{-1ex}

\input{gps-overview-figure/fig}
% \begin{figure*}
% \centering
% % \includegraphics[width=\textwidth]{figures/UAI-Figure.drawio.pdf}

% \caption{High-level illustration of calibration and inference steps in the GPS framework. During calibration, we first record sample counts $K_i$ needed for admissible solutions for each calibration input $X_i$ using the given deep generative model. Next, we give the augmented calibration examples $\{(X_i, K_i)\}$ to a conformal regression algorithm to create a calibrated estimator $\hat{f}$ of $\hat{K}(X)$ for a given input $X$. During inference, we employ the calibrated estimator $\hat{f}$ to predict $\hat{K}(X_{n+1})$ from a given test input $X_{n+1}$ and collect $\hat{K}(X_{n+1})$ samples from the given generative model as our prediction set.}
% \label{fig:gps-overview}
% \end{figure*}

%\vspace{1ex}

CP is a general %uncertainty quantification 
framework for constructing prediction sets with valid coverage guarantees \cite{Vovk2005-cp,lei2014distribution}. Split CP is the most widely used variant of CP which works as follows. Given a calibration dataset $\{(X_i, Y_i)\}_{i=1}^n$ (e.g., image and class label pairs), a predictive model $\hat{f}: \mathcal{X} \rightarrow \mathcal{Y}$ (e.g., neural network classifier), a significance level $\alpha \in (0, 1)$, and a new test input $(X_{n+1}, Y_{n+1})$ such that $\{(X_i, Y_i)\}_{i=1}^{n+1}$ are {\em exchangeable}, CP constructs a prediction set $\hat{C}(X_{n+1}) \subset \mathcal{Y}$, such that the set contains $Y_{n+1}$ with $1-\alpha$ probability. 
A key component of CP is the \textit{non-conformity score}, $\mathcal{S}: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R}$, a heuristic measure of the degree to which a classifier's prediction conforms to a given input.
%For example, a common choice for regression tasks is the residual, i.e. $\mathcal{S}(x, y) = |y - \hat{f}(x)|$. 
For classification tasks, the output distribution of a classifier can be used as $\mathcal{S}(x, y)$ = $1 - \hat{f}(y| x), y=1,\cdots,|\mathcal{Y}|$.
Let $S_i = \mathcal{S}(X_i, Y_i)$ denote the non-conformity score of the $i$-th calibration example and $\tau$ = $Q_{1-\alpha}(\{S_i\}_{i=1}^n)$ is the conformal $\alpha$-quantile (i.e., the $\ceil{(1-\alpha)(n+1)/n}$ quantile) of the empirical distribution of the scores.
% Let $Q_{\alpha}(\{x_i\}) = \inf_{j \leq n}\left\{\sum_{i=1}^j \mathds{1}\{ x_i \geq \alpha  \}/n \right\} $ denote the empirical quantile of the sample $Z_i$.
CP defines the prediction set as:
% \vspace{-1ex}
\begin{equation}
\label{eq:conformal-set-classification}
\hat{C}(X_{n+1}) = \{y \in \mathcal{Y}: \mathcal{S}(X_{n+1}, y) \leq \tau\}    
\end{equation}

%, where $\Tilde{\alpha}=\ceil{(1-\alpha)(n+1)}/n$ can be viewed as a finite sample correction. 
Let $P$ denote the joint distribution of all $n+1$ samples. CP then provides the following coverage guarantee~\cite{lei2014distribution}:
% \vspace{-1ex}
\begin{equation}
    \label{eq:conformal-guarantee}
    P\{Y_{n+1} \in \hat{C}(X_{n+1})\} \geq 1-\alpha
\end{equation}

If scores are almost surely distinct (e.g., scores are continuous), %one can also
we can obtain an upper bound~\cite{romano2019conformalized}:
% \vspace{-1ex}
\begin{equation}
    \label{eq:conformal-guarantee-ub}
    P\{Y_{n+1} \in \hat{C}(X_{n+1})\} \leq 1-\alpha + \frac{1}{n+1}
\end{equation}
The guarantees above are \textit{distribution-free}, i.e., they hold for any data distribution $P$ for any $\alpha \in (0, 1)$. They are also \textit{marginal}, in the sense that the probability statements above are over the randomness of all $n+1$ inputs (both the calibration data and the test input), and not conditional on either the calibration data or the test input $X_{n+1}$.

% \vspace{-2ex}

\subsection{CP for Open-Ended Generative Tasks} 

% \vspace{-1ex}

The non-conformity score $\mathcal{S}(x, y)$ measures the distance between a base predictor's outputs $\hat{f}(x)$ and true labels $y$. For a calibration example $(x, y)$, this score represents the minimum ``radius'' around $\hat{f}(x)$ that encompasses the true label $y$. This interpretation is straightforward in regression for the typical score $\mathcal{S}(x, y) = |y - \hat{f}(x)|$. At test time, computing the appropriate quantile $\tau$ of these distances allows us to construct sets containing all $y$ within $\tau$-distance of $\hat{f}(x)$.
This approach relies critically on our ability to enumerate all possible outputs within a given radius of $\hat{f}(x)$. For tasks with bounded or ordered output spaces, such as multiple-choice questions using LLMs, this enumeration is feasible and the CP procedure can be directly applied~(\cite{kumar2023conformal}). However, open-ended generative tasks involve combinatorial output spaces that have neither of these properties. In such cases, we can only access a restricted subset of the output space $\mathcal{Y}' \subset \mathcal{Y}$ (such as $M$ samples from a given generative model), which introduces two key challenges:

%\vspace{0.5ex}

\noindent{\bf Sample coverage:} We cannot guarantee that other valid outputs $y \in \mathcal{Y}$ don't exist outside our restricted sample space $\mathcal{Y}'$. The vanilla CP process fails to account for the sampling procedure during calibration, as it computes scores $\{S_i\}$ without considering whether true labels appear in the accessible subspace of $\mathcal{Y}$ or not. To address this challenge, we can modify our score to explicitly depend on the samples $(y_1', \dots, y_\tau')$ for some stopping rule $\tau$. One example is:%For example, one straightforward formulation is:
        %\vspace{-2ex}
    \begin{equation}
        \tilde{\mathcal{S}}(x, (y_1', \dots, y_\tau'), y) = \begin{cases}
            \mathcal{S}(x, y), \text{ if } y \in (y_1', ..., y_\tau') \\
            \infty
        \end{cases} 
    \end{equation}
    
    In practice, we typically require $\tau$ to be bounded. Importantly, for any generative model under finite sampling, there exists a minimum achievable error rate $\alpha^*$. For any target $\alpha < \alpha^*$, the only way to maintain valid coverage is to output the entire space $\mathcal{Y}$ on some inputs -- effectively abstaining on this part of the data distribution.

\noindent{\bf Semantic equivalence:} Generative model outputs often have multiple semantically equivalent ``correct'' solutions. For example, different valid sorting algorithms could solve the same programming task. 
This can lead to situations where our ground truth label, $y$, might not be accessible in $\mathcal{Y}'$, but there might still be a $y' \in\mathcal{Y}'$ that is semantically equivalent to $y$.
% If we sample $(y_1, y_2, y_3, y_4)$ where $y_1$ and $y_4$ are semantically equivalent but syntactically different, and our calibration data contains $(x, y_4)$, we might incorrectly set the minimum number of samples to get admissible output $k(x, t) = 4$ instead of $1$. 
To handle this challenge, \cite{Quach2023-mq} introduce a binary admissibility function $\mathcal{A}(x, y)$ that evaluates correctness independent of specific reference labels which we adopt in our work also. 
% This leads to a modified conformal coverage guarantee:
%     \vspace{-1ex}
%     \begin{equation}
%         P\{ \exists Y \in \hat{C}(X_{n+1}): \mathcal{A}(X_{n+1}, Y) = 1 \} \geq 1-\alpha
%     \end{equation}

% The non-conformity score $\mathcal{S}(x, y)$ captures the distance between the predictions of a base predictor $\hat{f}(x)$ and true labels $y$. Computing $\mathcal{S}(x, y)$, for an example $(x, y)$ in the calibration dataset can be seen intuitively is find the minimum "radius" around $\hat{f}(x)$ that will capture the ground truth label $y$. This intuition is clear when we consider a regression problem with $\mathcal{S}(x, y) = |y - \hat{f}(x)|$, and can be formalized as an alternative framing of CP (see~\cite{gupta2022nested}). If we compute the appropriate quantile of these distances, $\tau$, at test time, given a prediction $\hat{f}(x)$, we can produce a prediction set of all $y$ that lie within $\tau$-distance of $\hat{f}(x)$. But if we only have access to a restricted subset of the output space $\mathcal{Y}$ (in our case, the first $M$ samples from the generative model), we can't be sure if there exist other $y \in \mathcal{Y}$ that we didn't observe in our restricted subspace that have a non-zero probability of being correct. This is because we are not taking into account the sampling process during the calibration phase in the vanilla CP process; we compute scores $\{S_i\}$ based on $(X_i, Y_i)$ in the calibration data, without accounting for whether the true label actually appears in the restricted subspace of $\mathcal{Y}$ that we can access at test time. For some inputs, based on the quality of the underlying generative model, we might not be able to produce a set of samples such that the ground truth label appears in the set. To account for these, we can modify our score $\mathcal{S}$ to explicitly be a function of the samples $(y_1', \dots, y_M')$. In the simple case where we always sample a fixed number, $M$ samples from the generative model, this can be done as:
% \begin{equation}
%     \tilde{\mathcal{S}}(x, (y_1', \dots, y_M'), y) = \begin{cases}
%         \mathcal{S}(x, y), \text{if } y \in (y_1', ..., y_M') \\
%         \infty
%     \end{cases} 
% \end{equation}

% Another challenge of applying the vanilla CP procedure concerns a specific property of the output spaces of generative models; they often contain multiple semantically equivalent `correct' solutions. For example, a user might ask an LLM to produce a python program to sort a list of numbers. There are exist many different sorting algorithms which can reasonably considered valid solutions, but our calibration data contains only one of as the ground truth. Concretely, if we sample $(y_1, y_2, y_3, y_4)$, where $y_1$ and $y_4$ are semantically equivalent but distinct in exact token representations, and our calibration data contains $(x, y_4)$, then we might incorrectly set $k(x, t) = 4$ instead of $1$. At test time, this will lead to overly conservative sets. Thus, to account for this, we must relax our notion of what a correct solution is. \citet{Quach2023-mq} mitigate this issue by introducing a binary \textit{admissibility} function, $\mathcal{A}(x, y)$ that removes dependence on the specific label $y$ that appears in our calibration data. Many real world tasks such as code generation or math problem solving have a natural notion of admissibility that can be evaluated automatically, and independent of a reference label. Of course, one can still define admissibility as $\mathcal{A}(y, y') = 1$ if $y=y'$ and 0 otherwise if reference-less evaluations are not possible (e.g. in natural language tasks). 

% With this notion of admissibility, we can restate the conformal coverage guarantee in our setting as follows:
% \begin{equation}
% \label{eq:cp-with-admissibility-guarantee}
% P\{ \exists Y \in \hat{C}(X_{n+1}): \mathcal{A}(X_{n+1}, Y) = 1 \} \geq 1-\alpha
% \end{equation}


% The main challenge in the absence of an ordering is that we must enumerate $\mathcal{Y}$ to construct the set prescribed in Eq.~\ref{eq:conformal-set-classification}. This is intractable if $\mathcal{Y}$ is unbounded. For example, the output space of a transformer model \cite{Vaswani2017-sk} with vocabulary $V$ is the set of all strings based on $V$ (i.e., $\mathcal{Y} = \cup_{n=1}^\infty |V|^n$). 

% \vspace{-2ex}

\subsection{Problem Setup and Closest Work}

% \vspace{-1ex}


\paragraph{Problem setup.} We are now ready to formally describe our problem setting. We are given a conditional generative model $\hat{\pi}(\cdot|X)$ over the space $\mathcal{X} \times \mathcal{Y}$, and calibration data $\mathcal{D}_{X}$=$\{(X_i)\}_{i=1}^n$ drawn independently from an unknown distribution $P_X$. We assume that the output space $\mathcal{Y}$ is non-enumerable and unordered.
Let $\mathcal{A}: \mathcal{X} \times \mathcal{Y} \rightarrow \{0, 1\}$ denote an admission function that measures the admissibility of a solution $Y \sim \hat{\pi}(Y|X)$. 
For example, in a code generation task, $\mathcal{A}$ could be a function that checks if a generated program passes all test cases. 
Given a test input $X_{n+1}$, our goal is to generate a prediction set $\hat{C}(X_{n+1}) \subset \mathcal{Y}$ %for $Y_{n+1}$,
such that the set contains at least one admissible output $Y$ with high probability. Formally, for a specified significance level $\alpha \in (0, 1)$, we want to provide the following guarantee:
% \vspace{-1ex}
\begin{equation}
\label{eq:gaps-target-guarantee}
P\{ \exists Y \in \hat{C}(X_{n+1}): \mathcal{A}(X_{n+1}, Y) = 1 \} \geq 1-\alpha
\end{equation}
Our goal is to achieve this guarantee in a finite-sample, distribution-free setting, while minimizing the cost (e.g., number of samples or API usage cost) to generate $\hat{C}(X_{n+1})$. %Lastly, it is desirable for the prediction sets to be adaptive, i.e., the set size $|\hat{C}(X_{n+1})|$ varies across inputs depending on task difficulty: small sets for easy tasks and vice versa. 

\noindent {\bf Conformal language modeling.} CLM \cite{Quach2023-mq} is the closest prior work to ours. It constructs prediction sets with a different type of guarantee than vanilla CP. Specifically, given parameters $\alpha, \delta \in (0, 1)$ and test input $X_{n+1}$, CLM constructs a prediction set $\hat{C}(X_{n+1})$ that satisfies:
% \vspace{-2ex}
\begin{equation}
    \label{equation:clm-pac-style}
    \begin{split}
        P(P(\exists Y \in \hat{C}(X_{n+1}): \mathcal{A}(X_{n+1}, Y) = 1 | \mathcal{D}_{\text{cal}}) \\
        \geq 1-\alpha) \geq 1-\delta
    \end{split}
\end{equation}
Here, the inner probability is over draws of $X_{n+1}$ and the outer probability is over draws of the calibration dataset $\mathcal{D}_{\text{cal}}$. This nested probability structure makes direct comparisons between CLM and CP-based methods challenging, as we discuss in Section~\ref{app:equate-clm-gps}.
CLM consists of three key components: a set confidence estimator $\mathcal{F}$, a sample quality estimator $\mathcal{Q}$, and a sample similarity function $\mathcal{S}$, each parameterized by a threshold $\lambda$. The algorithm iteratively builds prediction sets by generating samples from the generative model and adding them to the set if they meet both quality and diversity thresholds (using $\mathcal{Q}$ and $\mathcal{S}$ respectively). This process continues until the set quality threshold is met, as determined by $\mathcal{F}$. The thresholds $(\lambda_1, \lambda_2, \lambda_3)$ that control risk for each component are determined using the Learn-Then-Test framework \cite{angelopoulos2021learn}.


% In this section, we discuss the closest prior work to ours, CLM, in more detail. 
% Given parameters $\alpha, \delta \in (0, 1)$, test input $X_{n+1}$, CLM constructs a prediction set, $\hat{C}(X_{n+1})$, with the following guarantee :
% % \begin{equation}
% %     P(P(\exists Y \in \hat{C}(X_{n+1}): \mathcal{A}(X_{n+1}, Y) = 1 | \mathcal{D}_{\text{cal}}) \geq 1-\alpha) \geq 1-\delta
% % \end{equation}


% The inner probability is over draws of $X_{n+1}$ whereas the outer probability is over draws of $\mathcal{D}_{\text{cal}}$. Notice that this guarantee is qualitatively different from the one offered by vanilla CP, which makes direct comparisons of LTT based methods to CP based methods tricky; we discuss this issue in Section\devjeet{TODO: X}. 

% As briefly alluded CLM consists of three components: a set confidence estimator $\mathcal{F}$, a sample quality estimator $\mathcal{Q}$, and a sample similarity function $\mathcal{S}$. Each of these components is parameterized by a threshold $\lambda$. The algorithm works as follows: 1) generate a sample from the generative model, 2) add sample to the set if it meets a) the quality threshold using $\mathcal{Q}$ and b) the diversity threshold using $\mathcal{S}$, and 3) repeat sampling until the set quality threshold is met, using $\mathcal{F}$. To determine thresholds for each component, $(\lambda_1, \lambda_2, \lambda_3)$ that control risk, CLM employs the Learn-Then-Test framework \cite{angelopoulos2021learn}.



% \paragraph{Preview of GPS} Our proposed framework, \methodname\, provides an sampling cost and abstention efficiency based trade-off between these two approaches. \methodname\ calibrates only a single stopping rule based only a single stopping rule based only on the inputs. We do this by simply noticing that if the samples $\{y_1, y_2, \ldots\} $ are collected independently, as is common with LLMs, $\mathcal{A}(x, y_i)$ follows a Bernoulli distribution, and thus the stopping rule follows a geometric distribution. We construct our nested sets as $\mathcal{F}_t(x) = \{y_1, \ldots, y_{\hat{K}(x)}, \ldots, y_{\hat{K}(x) + t}\}$ for $\hat{K}(x) + t \leq M$ and $\mathcal{Y}$ otherwise. We use the fact that the stopping rule is a geometric distribution as a function only of the input $x$ to learn a parametric estimator $\hat{K}(x)$. Since $\mathcal{F}_t$ is a set, de-duplication occurs naturally, and we show that this filtering is enough to produce small prediction sets comparable with CLM, while requiring a smaller number of samples like CLM, and provides lower abstention rates at tight $\alpha$-levels as well as lower computational complexity like vanilla CP methods. We can see in Figure~\ref{fig:motivating-example}, our base set generator \texttt{GPS L}, which only uses the input prompt's log probabilities, maintains similar set sizes and number of samples as CLM's best variant. But once we use a predictor on top of the input hidden states (\texttt{GPS HL}) it is able to provide us with a larger range of usable $\alpha$ with lower abstention rates while maintaining set sizes and number of samples.                      
% While prior work such as CLM are able to produce prediction sets with valid coverage guarantees, they either require a fixed number of samples to be collected from the model before sets are generated, or require \textit{online} sampling from the model. For example, CLM might produce an empty set if none of the candidate generations satisfy it's quality filters; but one must still pay the price of the generated samples. This makes it inefficient in terms of sampling cost. Moreover, imagine a scenario where we have a set of two models, one small and cost-effective (e.g. Phi-2) and the other expensive, but with high performance (e.g. GPT-4). One might wish to select which model to use based on a given input. To perform such efficiency oriented model selection (e.g. \cite{liang2024conformal}) while generating prediction sets, we need to know how many samples each model might require to produce the prediction set. With prior work, one cannot know the number of samples collected by a CP method apriori, which prevents their usage in such scenarios. Our proposed method, GPS, alleviates these limitations.
% However, open-ended generative modeling tasks present challenges, i.e. when $\mathcal{Y}$ is unbounded and unordered, we cannot enumerate $\mathcal{Y}$ to construct the set in~\ref{eq:conformal-set-classification}. 

% For examples, the output space for transformer model \cite{Vaswani2017-sk} with vocabulary $V$ is the set of all strings based on $V$ (i.e., $\mathcal{Y} = \cup_{n=1}^\infty |V|^n$). 

%. Examples of such tasks include multiple-choice question answering or time-series forecasting using large language models (LLMs). 
% However, many generative modeling tasks require combinatorial output spaces that have neither of these properties, and poses significant challenges for applying CP in this setting. The main challenge in the absence of an ordering is that we must enumerate $\mathcal{Y}$ to construct the set prescribed in Eq.~\ref{eq:conformal-set-classification}. This is intractable if $\mathcal{Y}$ is unbounded. For example, the output space of a transformer model \cite{Vaswani2017-sk} with vocabulary $V$ is the set of all strings based on $V$ (i.e., $\mathcal{Y} = \cup_{n=1}^\infty |V|^n$). %Thus, to apply CP in such settings, we must transform $\mathcal{Y}$ to either a bounded space, or to an unbounded but ordered space. 

% \subsection{Problem Setup and Closest Prior Work}

% \vspace{1ex}
