%\vspace{-2ex}
\section{Introduction}

Conditional deep generative models (DGMs) have recently become the dominant paradigm in a wide range of machine learning problems arising in domains including natural language processing, computer vision, and software engineering. Notable examples of DGMs include large language models (LLMs)~(\cite{brown2020language}), diffusion models~(\cite{yang2023diffusion}), and vision transformers~(\cite{khan2022transformers}). While these DGMs have displayed increasingly powerful predictive performance across a variety of generative tasks, they are far from infallible. For example, LLMs are known to produce hallucinations~(\cite{rawte2023survey}). %, outputs that may appear to be superficially sound and fluent in natural language, but contain factual errors. 
Such erroneous outputs 
%corrode trust in DGM enabled applications, and 
pose challenges in their use in safety-critical domains including medicine, law, education, and finance. For example, in many modern IDEs, users are able to query a LLM to generate programs~ \cite{asare2023github}. It would be desirable to attach to this set of suggested outputs a measure of uncertainty.  This motivates the need for theoretically-sound uncertainty quantification for deep generative models to deploy them safely.

% As a motivating example, consider code generation task, a common use-case for DGMs. In many modern IDEs, users are able to query an LLM to generate programs. When the user submits a query, one or more suggestions from the LLM are presented. However, LLM generated code can often be incorrect, insecure, and prone to bugs \cite{asare2023github}. Ad-hoc measures such as the model's log-likelihood can be uncorrelated with functional correctness due to surface form competition~\cite{holtzman2021surface,kuhn2023semantic}. Moreover, the user has no knowledge of whether \textit{any} of the programs presented are correct. Thus, it would be desirable to attach to this set of suggested outputs a measure of confidence, or uncertainty, that can be used by the user to make informed decisions. For example, based on this uncertainty measure, they could decide to either select a program from the suggestions, or simply discard suggestions if the uncertainty is too high. \devjeet{I think we can remove this para and }

% \begin{figure*}[htbp]
%    \centering
%    \begin{subfigure}[b]{0.33\textwidth}
%        \includegraphics[width=\textwidth]{figures/gps-motivating-figures/motivating-example-abstention-rate.pdf}
%        \caption{Abstention Rate vs $\alpha$}
%        \label{fig:motivating-example-abstention}
%    \end{subfigure}
%    \hfill
%    \begin{subfigure}[b]{0.33\textwidth}
%        \includegraphics[width=\textwidth]{figures/gps-motivating-figures/motivating-example-set-size.pdf}
%        \caption{Set Sizes vs $\alpha$}
%        \label{fig:motivating-example-setsize}
%    \end{subfigure}
%    \hfill
%    \begin{subfigure}[b]{0.33\textwidth}
%        \includegraphics[width=\textwidth]{figures/gps-motivating-figures/motivating-example-number-of-samples.pdf}
%        \caption{Number of Samples vs $\alpha$}
%        \label{fig:motivating-example-samples}
%    \end{subfigure}
%    \caption{Performance of CLM and GPS on the GSM8k dataset with GPT-4o-mini as the base LLM. The abstention rates of GPS drop sharply close to the model abstention rate, and this results in a wider range of usable $\alpha$ for GPS, while maintaining set sizes similar to CLM. }
%    \label{fig:motivating-example}
% \end{figure*}
\begin{figure*}
\centering
\includegraphics[width=\textwidth]{figures/gps-motivating-example.png}
\caption{Abstention rate, set sizes, and number of samples collected from the GPT-4o-mini on the MATH benchmark for CLM and two variants of GPS. %, that differ in their input features; GPS L (prompt log probabilities) and GPS HS (prompt hidden state activations).
Since CLM provides PAC-style guarantees, we adjust the $\alpha$ level of GPS to equate the coverage guarantee. GPS HS produces valid prediction sets over tight $\alpha$ levels where CLM does not (green shaded area).} %\jana{what is the main message of this figure?}}

\label{fig:gps-motivating-example}
% \vspace{-2ex}
\end{figure*}
% Natural language to code generation is a feature integrated in many modern software development tools~\cite{github_copilot}. When the user engages with code generation features, they are presented with one or more generated programs that meet their query. However, it is cognitively taxing for the user to verify each program to select the right one.  
Conformal prediction (CP)~(\cite{Vovk2005-cp,lei2014distribution}) is a general uncertainty quantification framework to produce prediction sets with theoretical guarantees for any black-box predictive model. Given some calibration data and a user-specified significance level $\alpha$, CP generates sets which are guaranteed to contain the correct output with probability $1-\alpha$ (referred to as {\em coverage}) through a calibration step. However, in the absence of an ordering over outputs, CP requires an enumeration of the entire output space, making it difficult to apply to deep generative models with combinatorial output spaces.

% Prior work on prediction sets for such output spaces mitigate this issue via sampling. One could simply collect a fixed number, $M$, of samples $y \in \mathcal{Y}$ from the generative model for every input $x$, and apply CP for classification directly over these to select a subset of these samples for the prediction set based on some heuristic measure of sample quality. But this necessitates collecting $M$ samples from the generative model independent of the difficulty of the input. One could reduce this excess sampling by also calibrating a stopping rule for the sampling that is input dependent. 
% One of the earliest works in this field, Conformal Language Modeling (CLM), proposed by \citet{Quach2023-mq}, mitigates the combinatorial nature of the output space by calibrating a multi-dimensional parameter $(\lambda_1, \lambda_2, \lambda_3)$ that determines a) a stopping rule for sampling and b) one or more quality criteria to decide if a sample should be included in the final prediction set. It works as follows: First, we initialize an empty prediction set at $\hat{C}$. Next, we collect a sampled output $y$ from the generative model, and if $y$ meets one or more quality thresholds (e.g. it's log probability exceeds $\lambda_2$), we add it to $\hat{C}$. Then, if the entire set $\hat{C}$ meets a set based quality threshold (e.g. the sum of log probabilities of all samples in the set exceed $\lambda_1$), we output the set as our prediction set. If not, we collect a new sample and repeat the process.  CLM then uses Learn-Then-Test to find valid configurations of $(\lambda_1,\lambda_2, \lambda_3)$ such that the resulting sets provide coverage at the user specified level confidence level, $\alpha$, with high probability. However, it is not always possible to find valid configurations for tight confidence levels. Any calibration algorithm operating with finite sampling is fundamentally constrained by the base model's capacity to generate correct outputs within this finite horizon. To maintain distribution-free coverage when users request prediction sets at $\alpha$ levels exceeding the base model's abstention rate, we must output the entire space $\mathcal{Y}$ for at least some inputs - a case we refer to as abstention. We can see this phenomenon in Figure~\ref{fig:gps-motivating-example}.

There is little work on this challenging problem setting. Conformal Language Modeling (CLM) \cite{Quach2023-mq}, constructs prediction sets iteratively by examining each sampled output. It calibrates three parameters $(\lambda_1, \lambda_2, \lambda_3)$: $\lambda_1$ determines if enough high-quality samples have been collected while $\lambda_2$ and $\lambda_3$ filter the collected samples based on quality criteria (e.g. total log probability of the samples). CLM uses Learn-Then-Test (\citet{angelopoulos2021learn}) to find valid configurations of these parameters that provide coverage at the user-specified confidence level $\alpha$ with a  high probability. However, it is not always possible to find valid configurations for tight confidence levels. Any calibration algorithm operating with finite sampling is fundamentally constrained by the base model's capacity to generate correct outputs within this finite horizon. To maintain distribution-free coverage when users request prediction sets at $\alpha$ levels exceeding the base model's abstention rate, we must output the entire space $\mathcal{Y}$ for at least some inputs (referred to as {\em abstention}). We can see this phenomenon in Figure~\ref{fig:gps-motivating-example}.

On the MATH benchmark, GPT-4o-mini produces a correct output within the first 25 samples approximately 85\% of the time (marginally on the joint distribution). The dashed red line shows this base abstention rate. In practice, CLM's performance is even more constrained -- it abstains for 100\% of our data until $\alpha = 0.2$, and does not achieve a 0\% abstention rate until $\alpha=0.25$. This gap emerges because CLM's filtering parameters can incorrectly reject correct solutions with non-zero probability, effectively inflating the model's abstention rate. Moreover, CLM's abstention is binary: for a given calibration dataset, CLM either produces a valid configuration, in which case it \textit{never} abstains, or fails to produce a valid configuration, in which it \textit{always} abstains. As a consequence of these limitations, 1) we need to collect more samples to produce valid prediction sets, and 2) if $\alpha$ is close enough to the given DGM's abstention rate, the calibration algorithm must abstain with non-zero probability. 

% Results from \citet{Quach2023-mq} show that CLM obtains effective valid coverage with relatively small set sizes. However, it only does so on a small range of $\alpha$ values. We can see this in Figure~\ref{fig:gps-motivating-example} shows CLM's performance, at different confidence levels $\alpha$, on the MATH  benchmark using GPT-4o-mini as the base language model. The dashed read line shows the rate at which GPT-4o-mini is able to produce a correct solution in the first 25 samples; this is the base model's \textit{abstention rate}. 
% On the MATH benchmark, GPT-4o-mini produces a correct output within the first 25 samples approximately 85\% of the time (marginally on the joint distribution). The dashed red line shows this base abstention rate.
% A calibration algorithm's ability to produce valid prediction sets under a finite sampling strategy is limited by the underlying generative model's ability to produce ``correct'' outputs within the finite-horizon. For example, on average, GPT-4o-mini will produce a correct output in the first 25 samples collected from the model around 85\% of the time (marginally on the joint distribution). In other words, if we simply output the first 25 samples as our prediction set, without any filtering, we cannot hope to achieve a coverage of more than 85\%. If the user requests prediction sets at a level $\alpha$ that exceeds this rate, we must output the entire space $\mathcal{Y}$ for atleast some inputs to maintain distribution-free coverage properties. We refer to these cases as $\textit{abstention}$. In the figure, we can see that $CLM$ abstains for 100\% of our data until $\alpha = 0.2$, and does not achieve a 0\% abstention rate until $\alpha=0.25$. This is because even though the base model might produce correct solutions in 25 samples, if the filtering parameter of CLM incorrectly rejects correct solutions with non-zero probability, this will inflate the effective abstention rate of the base model. This has two consequences: 1) we need to collect more samples to produce valid prediction sets, and 2) if $\alpha$ is close enough to the underlying generative model's abstention rate, the calibration algorithm must abstain with non-zero probability.

% In this work, we present a simple calibration algorithm, \methodname\, to address these limitations. \methodname\ reduces the problem of constructing prediction sets for generative models to a regression problem: it calibrates an auxiliary predictor $\hat{f}$ to estimate how many samples the model needs to produce a correct output. At test time, we can use the upper bound of the prediction interval around $\hat{f}$ to construct a prediction set for the base model. A basic instantiation of \methodname\ (\texttt{GPS L} in Figure~\ref{fig:gps-motivating-example}) can achieve similar abstention rates, set sizes, and efficiency in number of samples as CLM. \texttt{GPS L} does so \textit{without} looking at the collected samples; it is a function of only the input $x$ (the prompt's log probability under the base model). If we slightly increase the complexity of $\hat{f}$ (shown as \texttt{GPS HS} in Figure~\ref{fig:gps-motivating-example}) to take as input a richer signal from the underlying LLM (e.g. hidden state activations for the input prompt), we can achieve dramatically lower abstention rates, and valid prediction sets at $\alpha$ levels closer to the base model's true abstention rate, while maintaining set sizes competitive with CLM at higher $\alpha$. In our empirical evaluations, we show that these benefits hold across a wide range of tasks including math, code and natural language tasks, and across a diverse set of base language models
In this work, we present a simple conformal calibration algorithm, referred to as {\em Generative Prediction Sets (GPS)}, to address these limitations. \methodname\ reduces the problem of constructing prediction sets for DGMs to a conformal regression problem: we learn an auxiliary predictor to estimate how many samples the model needs to produce a correct output, and employ conformal prediction (CP) tools~\cite{romano2019conformalized} to obtain valid prediction intervals around these estimates. This interval determines the number of samples we collect at test time to construct prediction sets in the original output space. A basic instantiation of \methodname\ (\texttt{GPS L} in Figure~\ref{fig:gps-motivating-example}) uses only the prompt's log probability as input features and can achieve similar abstention rates, set sizes, and sample efficiency as CLM without examining the sampled outputs by calibrating just a single parameter using vanilla CP. Thus, it is both simpler to implement in practice and more computationally efficient. If we incorporate richer signals from the underlying DGM (e.g., hidden state activations for the input prompt), the resulting variant (\texttt{GPS HS} in Figure~\ref{fig:gps-motivating-example}) can achieve dramatically lower abstention rates and valid prediction sets at $\alpha$ levels closer to the base model's true abstention rate, while maintaining set sizes competitive with CLM at higher $\alpha$. In our empirical evaluations, we show that these benefits hold across a wide range of tasks including math, code and natural language tasks, and across a diverse set of base LLMs. We will see later that lower abstention rates are due to \methodname' ability to \textit{selectively} abstain on specific inputs by specifying a stopping rule that exceeds our fixed sampling budget.

% The regression reduction, while simple, provides \methodname\ with a key benefit: it can \textit{choose} to abstain on specific examples based on the input, rather than either always or never abstaining for a given calibration data. 
% Additionally, \methodname\ has three qualitative benefits over CLM. First, \methodname\ reduces the problem of generating prediction sets for deep generative models to a vanilla CP problem. This reduction allows us to bring to bear the full body of machinery developed in vanilla CP to this problem setting. For example, the approximate conditional coverage method of \citet{Gibbs2023-ax} can be applied off-the-shelf to \methodname. Achieving such conditional coverage guarantees with CLM would require a non-trivial extension of the LTT framework.  Second, since \methodname\ calibrates only a single parameter, it has substantially lower computational complexity than CLM. Lastly, \methodname\ works in batch mode; it specifies how many samples are to be collected from the DGM apriori. \devjeet{TODO: Refer to schematic showing difference between CLM \& GPS procedures here} In contrast, CLM works sequentially, one sample at a time. Thus, \methodname\ can easily be used in modern batch inference pipelines, where it is desirable to produce all samples at once in a single batch to maximize hardware utilization, or in cost-sensitive applications, where users might want to make a trade-off between abstention rate and sampling cost. 

\noindent {\bf Contributions.} %\devjeet{TODO: This needs to be updated} 
The key contribution is the development and  evaluation of the generative prediction sets framework. %Specific contributions include:
\begin{itemize}
    \item Develop a provable CP method to produce valid prediction sets %that are adaptive %both in terms of the prediction set size, and the number of samples required 
    from a given deep generative model.
    \item \methodname\ is the first CP method for deep generative models that prescribes a stopping rule for sampling without requiring access to samples from the model at test time. %; making it suitable for efficient batch inference pipelines.
    \item Empirical evaluation on multiple benchmark datasets with diverse LLMs over text and code. %to show the efficacy of GPS over prior baseline methods.
    
\end{itemize}

% However, the sequential nature of sampling in CLM creates challenges for efficient resource allocation: samples must be collected one at time, quality filters be evaluated, and only then we know if sampling must stop or continue. It is not amenable for modern batch inference pipelines that can more efficiently produce an entire batch of outputs rather than single outputs.
% CLM calibrates it's parameters using a framework called Learn-Then-Test~\cite{angelopoulos2021learn}, which is similar yet distinct from CP, and produces sets with rigorous coverage guarantees. Their results demonstrate that CLM fares well on traditional

% to calibrate a stopping rule to produce an initial set of samples, and additional rejection sampling parameters that select a subset of the initial sample set as a the final prediction set. The resulting sets satisfy rigorous coverage validity. However, in practice, producing prediction sets using CLM can be computationally intensive and resource inefficient. This is because CLM operates in an online manner; it collects a sample from the generative model, adds it to the prediction set if it meets quality criteria, and then repeats this process until the entire set attains a high enough confidence score based on the calibration parameters. As a consequence, one cannot simply collect all samples in an optimized batch inference setting to maximize hardware utilization. Moreover, in cost sensitive settings, one might want to make a decision on whether to \textit{abstain} from producing a prediction set for a given input based on the sampling cost to produce it. For example, a language model service provider, such as OpenAI or Anthropic, might want to limit the cost incurred by producing prediction sets based on the difficulty of the input; if the input is difficult and we know that it would take a large amount of samples to produce a valid prediction set, we might want to make a trade-off between cost and abstention rate. But this requires knowing apriori how many samples are to be collected from the model, before any sampling has taken place.





% However, most conditional generative models have {\em combinatorial output spaces} (e.g., software programs conditioned on a given textual prompt in code generation). Enumerating these output spaces is computationally intractable and sampling many outputs from the given generative model can be costly (e.g., API usage cost). Hence, standard CP methods for classification are not applicable because they assume that the output space is small and finite. This challenge is analogous to classification for simple outputs vs. structured output prediction (e.g., sequences, trees, and graphs) \cite{bakir2007predicting}.%, or infinite but ordered.






% This paper introduces a novel framework referred to as {\em Generative Prediction Sets (GPS)} to  produce valid and adaptive prediction sets from generative models. The {\em validity} of a prediction set is determined by a user-defined binary admissibility
% function depending on the application. For example, in code generation task, given a textual prompt as input, producing a set of programs such that at least one program in the set passes all test cases. {\em Adaptivity} means that the size of the prediction set varies across inputs. ~\methodname~ produces {\em exactly valid} prediction sets through a calibration method and requires only black-box access to generative model for sampling outputs.  Given a black-box generative model $\hat{\pi}_{Y|X}$, significance level $\alpha$, calibration inputs $\{X_i\}_{i=1}^n$, and a binary set admission function $\mathcal{A}(X, Y)$ for the calibration procedure,~\methodname\ produces a prediction set $\hat{C}(X_{n+1})$ with the following guarantee for a new testing input $X_{n+1}$:
% \begin{equation*}
%     P\{\exists Y \in \hat{C}(X_{n+1}): \mathcal{A}(X_{n+1}, Y) = 1\} \geq 1-\alpha
% \end{equation*}
%  \methodname\ achieves this guarantee by conformalizing the \textit{sampling process} used to generate outputs. %The intuition behind \methodname\ is as follows: given a calibration input $X_i$, we can sample outputs from the generative model until we reach an admissible solution. 
%  The key insight behind GPS is to exploit the inherent structure within the distribution over the minimum number of samples needed to obtain an admissible output and use it to formulate a simplified calibration approach over the minimum number of samples. Specifically, if $K_i$ is the number of samples required to reach an admissible solution for input $X_i$, $K_i$ follows a {\em geometric distribution}, with it's success probability an unknown function $f$ of $X_i$. Therefore, if we conformalize a point estimator for $K_i$ by estimating $f$, we can obtain conformal coverage guarantees around the minimum number of samples required to obtain an admissible solution.

 % GPS has two distinct advantages over prior work as elaborated in the related work section. 
 % First, the number of samples GPS requires varies across inputs and depends on problem difficulty, quality of the underlying generative model, and effectiveness of a user provided heuristic.
% For new test inputs, allows the user to view the coverage guarantees that can be achieved for different cost budgets (simplest notion is the number of samples) which varies depending on the hardness of test problem.
% In contrast, prior work \cite{Quach2023-mq}
%~\cite{su2024api,wang2024conu} 
% always requires generating a fixed number of samples regardless of the test input.
% Second, GPS does not require iterative sampling from the model; it predicts the number of samples required to obtain a valid prediction set without requiring access to any generated samples. The user can either choose to generate the samples with an admissible output, or {\em abstain}, if the cost for generating samples is too high. Therefore, GPS is well-suited for batch inference pipelines, where generating outputs iteratively can be inefficient.

% \methodname\ has a key advantage over prior work~\cite{Quach2023-mq}; it can produce valid prediction sets that do not require iterative sampling from the generative model. Specifically, it prescribes the number of samples that should be generated to produce a set with the desired coverage guarantee. This allows \methodname\ to be easily integrated into real world batch inference pipelines. Furthermore, the computational cost associated with generating the prediction set—quantified by the number of samples—is known a priori. This allows users to make informed decisions about resource allocation; if the cost is deemed too high, users can choose to \textit{abstain} without incurring any cost.





 
% We evaluate \methodname\ on several code, math and natural language tasks, using diverse LLMs. Our results show that \methodname\ is both marginally valid and efficient; it produces both smaller prediction sets, and requires a smaller number of samples to construct these sets than the closest prior work. Moreover, \methodname\ does this while having a much lower rate of abstention than prior work; it produces valid prediction sets at stricter coverage levels at which prior work abstains. %, all the while maintaining stricter coverage guarantees.
 %does not require generating a fixed number of samples for every test input, unlike prior work~\cite{su2024api,wang2024conu}, which always incurs the same cost regardless of the hardness of the test problem. Second, it does not require iterative sampling from the model; it predicts a cost (number of samples) required to obtain a prediction set without requiring access to any generated samples. Based on this predicted cost, the user can either generate these samples with the guarantee that an admissible output will be contained in this set, or {\em abstain}, if the predicted cost is too high. This apriori cost prediction also makes our method suitable for batch inference pipelines, where generating predictions iteratively can be inefficient.

%GPS explicitly takes into account a generalized notion of the \textit{cost} of generating samples from the generative model, and for new input examples, allows the user to view the coverage guarantees that can be achieved for different cost budgets.
 % GPS is a general framework that has many advantages. First, we could use different types of cost (e.g., number of samples, API usage cost, hardware utilization) to generate valid prediction sets. Second, the admission function setup is compatible with many human-ML collaborative systems (e.g., software engineering and medical practice). Third, the prediction sets from GPS are adaptive whose size varies depending on the hardness of input. Third, as a consequence of GPS's methodological formulation, it can provide semi-conditional guarantees. When compared to the only closely related work referred to as conformal language modeling (CLM) \cite{Quach2023-mq}, GPS provides stronger theoretical guarantees and produces adaptive prediction sets (CLM produces fixed size sets for all inputs).

% The generality of the cost heuristics enable the generation of sets with many different types of guarantees. For example, we could model the cost by the number of samples, wall time taken to generate samples, or api costs. The admission function setup is compatible the common scenario in many generative modelling tasks where for a given input, many different outputs can be admissible. For example, in code generation problems, there are many semantically equivalent programs that could satisfy the user's inputs, and we only care that the model generates one of these. ~\methodname\ can provide such guarantees at multiple levels of cost simultaneously, in a sample efficient manner. Moreover, under certain conditions,~\methodname~can also provide exact conditional guarantees. Even if these conditions are not met,~\methodname~is fully compatible with a wide range of existing conformal prediction techniques that provide semi-conditional guarantees.

% \vspace{-1ex}
