
% \begin{table*}[!htb]
% \centering
% \caption{Results comparing GPS and CLM on different datasets and generative models across 100 trials. \textbf{Bold} indicates best performance within each subgroup. Refer to Appendix~\ref{app:experiment} for additional values of $\alpha$ and standard deviations (omitted for space).}
% \fontsize{9pt}{11pt}\selectfont
% \setlength{\tabcolsep}{1mm}
% % \resizebox{\textwidth}{!}{
% % \small % or \footnotesize if needed

% %% ORIGINAL TABLE
% % \input{tables/clm-vs-gps-multi-alpha}
% %% UPDATED TABLE WITH ADJUSTED ALPHA VALUES
% \input{tables/updated-gps-vs-clm-with-adjusted-alphas}

% % }
% \label{tab:clm-vs-gps-multi-alpha}
% \end{table*}

\vspace{-2ex}

\section{Experimental Results}
\label{sec:experimental-results}

In this section, we describe the experimental evaluation of our proposed GPS framework, comparison with state-of-the-art methods, and discuss results along different dimensions. 

\vspace{-2ex}
\subsection{Experimental Setup}

%\vspace{1ex}

\begin{figure*}[!h]
     \centering
        \includegraphics[width=\textwidth]{figures/gps-all-results-main.png}
     \caption{Results comparing GPS and CLM on different datasets (one row each for GSM8K, MATH, MBPP, and TriviaQA) using GPT-4o-mini LLM, across 100 trials, with shaded regions indicating standard deviation. Both \texttt{GPS} variants achieve lower abstention rates, and higher non-abstention coverage, while maintaining competitive set sizes with respect to CLM.} % Appendix~\ref{app:experiment} has results for additional models/datasets.}
     \label{fig:main-results}
\end{figure*}

\noindent {\bf Benchmark datasets.}
We employ data from three diverse tasks. %a diverse set of tasks, including code generation, natural language understanding, and math problem-solving for our experimental evaluation. 
{\em 1) Code generation tasks:} we employ the MBPP\cite{Austin2021-qs} and DS-1000~\cite{lai2023ds} datasets to evaluate performance on code generation tasks. We use execution based functional correctness as our binary admission function. {\em 2) Math problem-solving tasks:} We employ the GSM-8K~\cite{Cobbe2021-lg} and Math~\cite{hendrycks2021measuring} datasets on math problem-solving tasks. {\em 3) Natural language understanding tasks:} We employ a natural language question-answering dataset, Trivia-QA~\cite{joshi2017triviaqa}.  For both math problems and natural language understanding, we use exact match as the admissibility function. Since our baseline CLM wasn't able to provide any valid configurations DS-1000 for all $\alpha \in [0.1, 0.5]$, we defer results for DS-1000 to the Appendix. %We generally follow a 50/25/25 split for training, calibration and testing respectively, but for smaller datasets like MBPP we reduce the size of the training split to ensure enough examples are available for calibration and testing.

% \vspace{1ex}

\noindent {\bf Deep generative models.} We consider three models across various parameter scales: Phi-2~\cite{li2023textbooks}, Llama 3 8B, and GPT-4o mini. For Math and DS-1000, we also add Gemma 2 27B~\cite{team2024gemma}, since Phi 2 and Llama 3 8B had low success rates for both of these tasks. Across all models, we use nucleus sampling~\cite{holtzman2019curious}, with a sampling budget of 25.
For space reasons, we only present results on GPT-4o mini in the main text, while the results for the other models are provided in the Appendix.

% % We provide details about the generation configuration, including prompts, in the Appendix.

% \vspace{1ex}

\noindent {\bf Configuration of GPS.} We consider two predictors for the admissibility estimator $\hat{f}$. First, as a baseline, \texttt{GPS-L} uses a a linear regressor that directly predicts the probability of succes using the log probability of the input prompt. On the other hand,\texttt{GPS-HS} uses a feed-forward neural network that takes latent space representation of the prompt as input to predict the probability of success. Since GPT4o mini's hidden states aren't accessible, we use hidden state activations from Phi-2 as a surrogate. In the Appendix, we provide results for models with accessible hidden states.

%\vspace{1ex}


\noindent {\bf CLM baseline.} We employ CLM by adapting its code\footnote{https://github.com/Varal7/conformal-language-modeling} as our baseline for our experiments, using normalized log probabilities as the quality score ($\mathcal{Q}$), and ROUGE-L as the similarity score ($\mathcal{S}$), and $\delta = 0.05$. For the set quality score, $\mathcal{F}$, we consider the four variants reported in the original paper by \citet{Quach2023-mq}. These variants differ only in their choice of $\mathcal{F}$; \texttt{CLM\ First-K} uses the size of the prediction set, while \texttt{CLM\ Sum} and \texttt{CLM\ Max} use the sum and maximum of the quality scores of each sample in the set as the score. \texttt{CLM\ First-K} (nr) is the same as \texttt{CLM\ First-K}, but without any rejection rule (no rejection). 
% \begin{itemize}
%     \item First-K: $\mathcal{F}= |\hat{C}|$ 
%     \item First-K (nr): same as First-K but without rejection
%     \item Sum: $\mathcal{F} = \sum_{y \in \hat{C}} \mathcal{Q}(y)$
%     \item Max: $\mathcal{F} = \max_{y \in \hat{C}} \{\mathcal{Q}(y)\}$
% \end{itemize}

% The suffix (nr) denotes a variant of CLM for which sample rejection is disabled. 
%Note that while there exist other CP frameworks for LLMs, we exclude them from our evaluation since their implementations are not publicly available, to the best of our knowledge. 
% Moreover, at a conceptual level, unlike CLM and \methodname\ they do not calibrate a stopping rule for sampling; instead, they perform rejection of samples based on a fixed number of samples collected at test time. 

%\vspace{1ex}

\noindent {\bf Evaluation methodology.} We evaluate \methodname\ and CLM across $\alpha$ ranging from 0.1 to 0.5 across 100 trials in increments of 0.05. This covers the spectrum of $\alpha$ values that are useful in practice. CLM's guarantees are qualitatively different from that of vanilla CP, so for a fixed $\alpha$, the sets generated by CLM and \texttt{GPS} cannot be directly compared. However, we can use the fact that calibration conditional coverage of conformal predictors follows Beta distribution depending only on $\alpha$ and $n$, to find $\alpha_\delta$, the confidence level at which CP will achieve the same guarantee as CLM. Thus, we follow \citet{Quach2023-mq} and set $\delta=0.05$, and adjust $\alpha$ for GPS to ensure a fair comparison. %We discuss this in detail in the Appendix~\ref{app:equate-clm-gps}. 

\noindent {\bf Metrics.} We consider four metrics to measure on testing samples: {\bf 1)} average prediction set size, {\bf 2)} average number of samples generated, {\bf 3)} abstention rate, and {\bf 4)} non-abstention empirical coverage. As mentioned previously, the abstention rate allows us to determine the range of $\alpha$ on which a method can produce valid prediction sets. However, for methods such as \methodname\ that can abstain selectively on certain inputs, the abstention rate is not sufficient; for a given $\alpha$ level, a method might simultaneously have a both low abstention and low coverage when it doesn't abstain, i.e. we are not gaining any coverage as a consequence of lowering the abstention rate. Thus, we also present the non-abstention coverage rate: the fraction of data on which a method a) does not abstain, i.e., outputs finite sized sets, and b) contains an admissible solution. Since both CLM and GPS achieve valid empirical coverage, we %omit this from the main paper, and 
show those results in Appendix. %instead.

% Empirical miscoverage indicates the validity of the prediction sets at the target coverage level $\alpha$; we expect this to be less than or equal to $\alpha$. APSS measures the size of the sets, while NS represents the total number of samples that needed to be generated from the model to produce the prediction set. An efficient calibration algorithm will produce small sets, while minimizing the total number of samples generated. 


% Please add evaluation procedure: alpha values employed and why; testing data; evaluation metrics (description and motivation, and lower/higher is considered better). Do we repeat experiments multiple times?


\begin{figure*}[!h]
     \centering
     \begin{subfigure}[b]{\columnwidth}
         \centering
         \includegraphics[width=0.9\columnwidth]{figures/gsm8k-main-test-apss.pdf}
         \caption{GSM Benchmark}
         \label{fig:scaling-w-pass-rate-gsm}
     \end{subfigure}
     % \vfill
     \hfill
     \begin{subfigure}[b]{\columnwidth}
         \centering
         \includegraphics[width=0.9\columnwidth]{figures/mbpp-all.pdf}
         \caption{MBPP Benchmark}
         \label{fig:scaling-w-pass-rate-mbpp}
     \end{subfigure}
     \caption{Results for \texttt{GPS-HS}: average prediction set size (APSS) vs. coverage level $\alpha$ for (a) GSM8k and (b) MBPP benchmark datasets, using varying quality of models (Phi-2, Llama 3 8b, and GPT-4o-mini).}
     \vspace{-3mm}
\end{figure*}

%\vspace{-3ex}
\subsection{Results and Discussion}

% \begin{figure*}[!h]
%      \centering
%      \begin{subfigure}[b]{\columnwidth}
%          \centering
%          \includegraphics[width=\columnwidth]{figures/cost-vs-effective-coverage/cost-vs-effective-coverage-gpt-4o-mini_gsm8k-main-test_20240731052709.pdf}
%          \caption{GSM}
%          \label{fig:cost-vs-ecr-gsm}
%      \end{subfigure}
%      % \vfill
%      \hfill
%      \begin{subfigure}[b]{\columnwidth}
%          \centering
%          \includegraphics[width=\columnwidth]{figures/cost-vs-effective-coverage/cost-vs-effective-coverage-gpt-4o-mini_math-test_20240730181855.pdf}
%          \caption{MATH}
%          \label{fig:cost-vs-ecr-math}
%      \end{subfigure}
%      \caption{Results for \texttt{CLM} and \texttt{GPS}: total number of samples vs effective coverage at different($\alpha$ levels.}
%      \vspace{-3mm}
% \end{figure*}

%Figure~\ref{fig:main-results} shows the results for \methodname\ and CLM.        


%\vspace{1ex}

\noindent {\bf Overall discussion.} To summarize, empirical results in Fig ~\ref{fig:main-results} show that \texttt{GPS} can produce set sizes that are competitive with CLM, but at a wider range of $\alpha$ and with higher non-abstention coverage. We stress that \texttt{GPS} achieves such performance a) \textit{without} ever examining the generated samples, b) using a simpler calibration algorithm that is easy to implement in practice, and c) in $\alpha$-regimes where CLM simply doesn't produce valid configurations. We highlight our key observations from Figure~\ref{fig:main-results} below:
{\bf 1.} \texttt{GPS-L} and \texttt{GPS-HS} achieve higher non-abstention coverage than CLM variants, particularly for $\alpha \leq 0.3$. \texttt{GPS-HS} shows superior non-abstention coverage across all $\alpha$ except in MBPP, likely due to its small training set (150 examples) yielding a poor quality $\hat{f}$. On GSM8k (659 examples), \texttt{GPS HS} consistently outperforms all methods across all $\alpha$ in both abstention rates and non-abstention coverage. \\
{\bf 2.} For GSM, MATH, and TriviaQA, \texttt{GPS HS} maintains abstention rate of $< 1$ and abstention coverage $\geq 0.45$ even at $\alpha \approx 0.1$. This demonstrates $\hat{f}$'s effectiveness at selective abstention on difficult inputs, yielding valid configurations with non-trivial coverage. The predictor quality's impact is clear: \texttt{GPS L} uses same calibration but with a weaker $\hat{f}$. \\
{\bf 3.} \texttt{GPS L} and \texttt{GPS HS} require fewer samples to get prediction sets, especially on MATH. A narrow $\alpha$ range marking the transition between non-zero and zero abstention shows CLM requiring fewer samples than \texttt{GPS}, but with higher abstention rates and lower non-abstention coverage. The sole exception is \texttt{GPS L} on TriviaQA for $\alpha \in [0.35, 0.4]$, where the best CLM method achieves marginally higher non-abstention coverage with fewer samples. \\
{\bf 4.} Both \texttt{GPS} variants produce comparable set sizes to the best CLM method when abstention rates approach 0 ($\alpha \geq 0.3$). At tighter confidence levels, CLM yields smaller sets but with near-zero non-abstention coverage (e.g., GSM8k and MATH at $\alpha=0.2$), while \texttt{GPS L} maintains non-abstention coverage $\geq 0.4$. Set size differences diminish as abstention rates approach zero.
% {\bf 1.} Both \texttt{GPS-L} and \texttt{GPS-HS} generally obtain higher non-abstention coverage than all CLM variants in our experiments, especially for $\alpha \leq 0.3$. \texttt{GPS-HS} obtains a higher non-abstention coverage across all $\alpha$ for all datasets except MBPP. We attribute this to the small size of the training set in MBPP (150 examples), which result in a poor quality $\hat{f}$. The second smallest training set out of the ones presented in Figure~\ref{fig:main-results} is GSM8k with 659 examples, where \texttt{GPS HS} uniformly outperforms all other methods across all $\alpha$ levels in terms of abstention rates and non-abstention coverage. \\
% {\bf 2.} \texttt{GPS HS} has an abstention rate of $\leq 1$ and abstention coverage $\geq 0.45$ for GSM8K, MATH, and TriviaQA, even at $\alpha$ levels close to 0.1. This shows that our hidden state admissibility predictor $\hat{f}$ is effective at selectively abstaining on difficult inputs, resulting in valid configurations with non-trivial resulting coverage. It also demonstrates how the quality of the predictor can affect abstention rates within our framework; \texttt{GPS L} follows an identical calibration procedure but with a weaker $\hat{f}$. \\
% {\bf 3.} \texttt{GPS L} and \texttt{GPS HS} are efficient in terms of the number of samples required to construct the prediction set. This is especially clear when considering the MATH dataset, where both methods require less samples than the best CLM method. However, there is a small range of $\alpha$ that marks a transition point between non-zero abstention and zero abstention rates, where CLM tends to require less samples than \texttt{GPS}. However, at these levels, CLM typically has a higher abstention rate, and lower non-abstention coverage. The only exception to this is \texttt{GPS L} on TriviaQA for $\alpha \in [0.35, 0.4]$ where even though the best CLM method can obtain slightly higher non-abstention coverage while requiring a smaller number of samples.\\
% {\bf 4.} Both \texttt{GPS} variants produce set sizes that are extremely close to the best CLM method when abstention rates for all methods fall to near 0 (roughly $\alpha \geq 0.3)$. For tighter confidence levels, CLM can produce smaller sets (when it doesn't abstain), although the non-abstention coverage is too high for these cases. For example, consider GSM8k and MATH at $\alpha=0.2$. CLM is producing smaller sets but it has a non-abstention coverage of near zero, whereas \texttt{GPS L} attains non-abstention coverage of $\geq 0.4$. Generally, as abstention rates converge to zero, so do set sizes between \texttt{GPS} and CLM. 


% \noindent {\bf Set size vs. model quality.} We briefly discuss how the quality of the underlying generative model affects set sizes for \texttt{GPS}. Figures~\ref{fig:scaling-w-pass-rate-gsm} and~\ref{fig:scaling-w-pass-rate-mbpp} show the prediction set sizes produced by \texttt{GPS-HS} across different levels of model performance for MBPP and GSM at different $\alpha$ (additional results are included in the Appendix). Generally, the better the model's pass rate, i.e., number of problems for which the model is able to produce at least one admissible solution, the smaller the size of the generated sets. This demonstrates the ability of \methodname\ to scale with model quality. Note that the slight fluctuations observed at higher coverage levels ($\alpha \in \{0.15,0.2\})$ are due to abstention; \methodname\ tends to abstain at a higher rate for Phi-2 than Llama on both datasets, deflating prediction set sizes at these levels.

\noindent {\bf Set size vs. model quality.} %We examine how the underlying model quality affects \texttt{GPS} set sizes.
Fig~\ref{fig:scaling-w-pass-rate-gsm} and~\ref{fig:scaling-w-pass-rate-mbpp} show \texttt{GPS-HS} set sizes across model performance levels for MBPP and GSM at different $\alpha$ (see Appendix for more results). Higher model pass rates (problems with at least one admissible solution) correlate with smaller generated sets, demonstrating \methodname's ability to scale with model quality. Fluctuations at higher coverage levels ($\alpha \in \{0.15,0.2\})$ stem from abstention rates, which are higher for Phi-2 than Llama on both datasets, reducing prediction set sizes at these levels.

%\vspace{0.75ex}



% This demonstrates the efficacy of \methodname\ in the setting where the practitioner only has black-box access to sample outputs from the underlying generative model.   