% \documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
%                                           ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{multirow}
\usepackage[normalem]{ulem}
\useunder{\uline}{\ul}{}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{CSS: Contrastive Semantic Similarity for Uncertainty Quantification of LLMs}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<shuang.ao@open.ac.uk>?Subject=Your UAI 2024 paper}{Shuang Ao}{}}
\author[2]{Stefan Rueger}
\author[3]{Advaith Siddharthan}
% \author[3]{Further~Coauthor}
% \author[1]{Further~Coauthor}
% \author[3]{Further~Coauthor}
% \author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil{%
    Knowledge Media Institute (KMi)\\
    The Open University\\
    Milton Keynes, UK
}
% \affil[2]{%
%     Second Affiliation\\
%     Address\\
%     …
% }
% \affil[3]{%
%     Another Affiliation\\
%     Address\\
%     …
%   }
  
  \begin{document}
\maketitle

\begin{abstract}

Despite the impressive capability of large language models (LLMs), knowing when to trust their generations remains an open challenge. The recent literature on uncertainty quantification of natural language generation (NLG) utilizes a conventional natural language inference (NLI) classifier to measure the semantic dispersion of LLMs responses. These studies employ logits of NLI classifier for semantic clustering to estimate uncertainty. However, logits represent the probability of the predicted class and barely contain feature information for potential clustering. Alternatively, CLIP (Contrastive Language–Image Pre-training) performs impressively in extracting image-text pair features and measuring their similarity. To extend its usability, we propose Contrastive Semantic Similarity, the CLIP-based feature extraction module to obtain similarity features for measuring uncertainty for text pairs. We apply this method to selective NLG, which detects and rejects unreliable generations for better trustworthiness of LLMs. We conduct extensive experiments with three LLMs on several benchmark question-answering datasets with comprehensive evaluation metrics. Results show that our proposed method performs better in estimating reliable responses of LLMs than comparable baselines. The code are available at \url{https://github.com/AoShuang92/css_uq_llms}. 





\end{abstract}

\section{Introduction}
\label{sec:intro}

Despite recent breakthroughs in a wide range of  natural language generation (NLG) tasks~\citep{hoffmann2022training, touvron2023llama, chowdhery2023palm}, the uncertainty quantification (UQ) of large language models (LLMs) remains an open challenge. Without reliable measures of uncertainty, it is implausible to apply LLMs in critical tasks such as medical~\citep{singhal2023towards} or legal question-answering~\citep{louis2023interpretable}, or medical diagnosing~\citep{wang2023chatcad}. A reliable measure of uncertainty helps to decide when to trust a model, which is also the key problem in building safer AI systems~\citep{hendrycks2021unsolved}. Recently, LLMs have been deployed in the industry as powerful tools to assist professional or personal work, with well-known interfaces such as ChatGPT\footnote{\url{https://chat.openai.com/}}, Gemini\footnote{\url{https://gemini.google.com/}} and Perplexity AI\footnote{\url{https://www.perplexity.ai/}}. However, with the enhanced capabilities of LLMs, concerns are simultaneously raised about their trustworthiness. 



The study of UQ in LLMs has gained significant attention recently. Most existing methods are white-box, relying on either calculating entropy from predicted probabilities~\citep{malinin2020uncertainty, kuhn2023semantic} or querying models for their prediction confidence~\citep{lin2022towards, kadavath2022language}. However, these techniques often require task-specific labels, additional training data, or white-box access to the internal model information. Black-box UQ strategies address this by analyzing the consistency of information across model generations. Techniques like n-gram overlap~\citep{fomicheva2020unsupervised} assess surface-level similarity, while more recent approaches explore semantic equivalence~\citep{kuhn2023semantic, lin2023generating}. These methods cluster sentences based on meaning to estimate uncertainty, with a higher number of clusters indicating greater semantic diversity and thus higher LLM uncertainty. However, a key limitation lies in using Natural Language Inference (NLI) classifier logits to measure semantic equivalence. Logits represent class probabilities, not the semantic features needed for accurate clustering. This highlights the need for more sophisticated features that  better capture the true semantic relationships between generated texts.



The Contrastive Language-Image Pre-training (CLIP)~\citep{radford2021learning} learns the link between textual semantics and their visual representations rather than mapping features to a fixed set of predetermined object categories. In other words, it captures similarity features in a contrastive approach by learning how much a given text snippet relates to an image. Inspired by its promising function, we design CLIP to contrastively extract similarity features between text pairs, where semantic relations can be represented by feature patterns learned from the model. We propose the contrastive semantic similarity (CCS), where features contain implicit information about the semantic relations of text inputs. Our method allows the transitivity between the measurement of semantic equivalence and the inner semantic relations between text pairs. It also provides insightful clustering information to form semantic sets and further uncertainty estimation. 



We evaluate our method with selective NLG~\citep{ren2022out, cole2023selectively}, a self-assessment evaluation method to detect when the generations of LLMs are unreliable. Responses with high-uncertainty are likely to be wrongly generated, which will diminish the trustworthiness of a model. Therefore, accurate uncertainty estimation can provide higher performance in selective answering. The evaluation is conducted with the area under the accuracy/rejection trade-off curve. In this paper, we conduct extensive evaluation on several open and closed book free-form question answering benchmark datasets, with sampled set of answers for a given question generated by SOTA LLMs. Results show the superiority of our proposed method over the NLI classifier logits. Our contributions and findings are summarized as below:


\begin{enumerate}

\item We design a novel technique for UQ in LLMs that utilizes Contrastive Semantic Similarity (CSS) to extract insightful semantic relations between text pairs.

\item We modify the CLIP text encoder to obtain text-text pairs semantic similarities, then employ spectral clustering technique to estimate uncertainty of sampled generations of LLMs.

\item By conducting extensive experiments on LLMs and question-answering datasets, together with extensive ablation studies, we report: 
\begin{enumerate}
    \item our proposed method outperforms SOTA UQ techniques, indicating the contrastive semantic similarity contains more semantic information than NLI logits;
    \item  Contrastive feature extraction of CLIP are superior to regular language models, extending their application scope in language generation;
    
    \item  our proposed method enhances selective NLG by detecting unreliable generations more accurately, which reflect the effectiveness of our method for UQ in LLMs. 
\end{enumerate}

\end{enumerate}




\section{Related Work}

The study of UQ has attracted great attention in deep learning tasks such as classification or regression~\citep{lakshminarayanan2017simple, kendall2017uncertainties, abdar2021review, ao2023two}. However, most UQ techniques are not transferable to generative AI due to the unique challenges in free-form NLG in terms of (1) entropy calculation of the utmost high-dimension probability, (2) texts with distinct tokens but with identical meanings, and (3) accessibility of token-level probability or fine-tuning for end-users. To solve the extremely high-dimension output issue, ~\citep{malinin2020uncertainty} utilize the geometric mean token-probability to calculate the length-normalizing predictive entropy, based on the prior empirical success of~\citep{murray2018correcting}. Moreover, a recent study introduces a novel entropy-based uncertainty measure called semantic entropy~\citep{kuhn2023semantic}, incorporating linguistic invariances created by shared meanings.


Word overlap metrics such as METEOR~\citep{banerjee2005meteor}, BLEU~\citep{papineni2002bleu} and ROUGE~\citep{lin2004rouge} are typically used to measure similarities between text pairs. However, distinct tokens may carry similar semantic meanings, and these methods may fail to extract semantic relations between text pairs. To highlight semantic meanings in free-from NLG, semantic equivalence~\citep{kuhn2023semantic} is introduced via the bi-directional entailment algorithm of natural language inference (NLI), which is further utilized to cluster generations of LLMs based on their semantic meanings. Utilizing the concept of entailment to measure the semantic relations between text pairs is logical and understandable from a linguistic perspective. In other words, two sentences are semantically equivalent if they entail each other. This novel method is a breakthrough for text clusters based on semantic meanings instead of traditional n-gram token counting, but it still requires access to predicted probability. To measure the uncertainty of LLMs in a post-hoc fashion, Graph Laplacian is employed to cluster LLM generations that are represented by the NLI classifier~\citep{lin2023generating}. 



Selective NLG (also referred to as selective answering/generation, NLG with rejection) is the main application to evaluate the effectiveness of UQ methods for language generation. Samples with higher uncertainty are likely to be wrongly predicted or generated, and rejecting them can improve the reliability of the model. It is analogous to the commonly used term selective prediction in classification~\citep{lin2022scrib, geifman2017selective, ao2023empirical}. Both tasks can determine when to trust a model, whether it is a classifier or an LLM. Selective answering benefits the decision-making process and improves the trustworthiness of LLMs by detecting their failure outputs. 




\section{Methodology}
\label{sec:method}

This section discusses uncertainty quantification methods for LLMs based on measuring the information consistency across $m$ generated responses $\left\{r_1, r_2, \ldots, r_m\right\}$ for a given input question $x$.


\subsection{Background}

\paragraph{NLI Classifier.} The natural language inference (NLI) classifier has been used to measure semantic similarities for text pairs. The NLI classifier predicts classes as entailment, neutral and contradiction, via utilizing the pre-trained off-the-shelf DeBERTa model~\citep{he2020deberta}.  


\paragraph{Semantic Entropy with NLI Classifier.} Semantic equivalence of text pairs can be measured with NLI classifier logits/scores (referred as NLI logits for simplification). The NLI logits denotes as $s_{r_i, r_j}$ for text pairs $r_i, r_j$. If two sentences can entail each other, they share similar semantic meanings. Based on this linguistic concept, the recent study~\citep{kuhn2023semantic} hereby introduces the bi-directional entailment algorithm to measure semantic similarity between text pairs. All generations are clustered into three semantic sets by the predicted label of NLI logits. To obtain the likelihood of a semantic set, the predicted probability of each sentence in the cluster is accumulated. Given $m$ sampled responses of a given question, larger semantic sets indicate higher information consistency or lower uncertainty, as more sentences carry similar meanings. With the given input $x$ and its corresponding sampled $m$ responses, suppose number of semantic clusters as $C$, the semantic entropy ($SE$) estimated by Monte Carlo integration is written as: $SE(x) \approx-|C|^{-1} \sum_{i=1}^{|C|} \log p\left(C_i \mid x\right)$. This method requires  access to the predicted probabilities of LLMs. One limitation of this work lies in the over-simplified clustering, as ambiguous responses can belong to more than one class. Furthermore, the equivalence between NLI logits judged cluster and real semantic clusters is not guaranteed~\citep{lin2023generating}.


\paragraph{Graph Laplacian with NLI Classifier.} Given the pairwise similarities represented by NLI logits $s_{r_i, r_j}$, but without obtaining predicted probabilities of each generation, a straightforward way to cluster $m$ generations is via spectral clustering. For an input question $x$, let $R=\left\{r_i\right\}_{i=1}^m$ be the generation set for each item as a node. Based on the bi-directional entailment algorithm, the semantic relations between $r_i, r_j$ defined by NLI logits is written as: $w_{i, j}=\left(s_{i, j}+s_{j, i}\right) / 2$. Hence the symmetric weighted adjacency matrix $W$ is $W=\left(w_{i, j}\right)_{i, j=1, \ldots, m}$. The degree matrix $D$ is a diagonal matrix, where a node with a higher degree means well-connected with other nodes. The higher degree of one generation suggests it carries similar meanings with other generations, resulting in the lower uncertainty of LLMs. The degree for $r_i, r_j$ is written as: $D_{i i}=\sum_j W_{i j}$. When there are semantic relations between $r_i$ and $r_j$, $i = j$ and $D_{i i}$ is non-zero; otherwise $D_{i j}$ is 0. The pairwise distance represents the semantic difference between text pairs, and the degree matrix to estimate uncertainty is written as:

\begin{equation}
\label{eq:degree}
    U_{\text {Deg }}=\operatorname{trace}(m-D) / m^2
\end{equation}


The graph Laplacian $L$ is thereby: $L:=D-W$. The eigenvalues of $L$ are non-negative and sorted in ascending order: $\lambda_1 \leq \lambda_2 \leq \ldots \leq \lambda_n$. The eigenvectors form an orthogonal basis: $v_1, v_2 \ldots, v_n$. 

The corresponding eigenvalues and eigenvectors are used to measure the uncertainty of the sampled generation set $R$. In spectral clustering, the distribution of eigenvalues is used to determine the number of clusters~\citep{von2007tutorial}. Under the context of uncertainty for LLMs, the multiplicity of the zero eigenvalues coincides with the number of semantic sets~\citep{lin2023generating}. Thus the uncertainly estimated by eigenvalues-based semantic clusters ($U_{set}$) can be written as:

\begin{equation}
\label{eq:semset}
    U_{\text{Eig}} =\sum_{k=1}^m \max \left(0,1-\lambda_k\right)   
\end{equation}

As the eigenvalue in graph Laplacian, $\lambda$ coincides with the number of semantic clusters. Following previous work~\citep{lin2023generating, von2007tutorial}, eigenvalues larger than 1 are ignored as only the smallest few eigenvalues carry important information about the clusters. Hence equation~\eqref{eq:semset} picks the max value between 0 and $1-\lambda_k$ to ignore eigenvalues larger than 1. 

The eigenvectors are treated as coordinates for nodes (sampled generation). The informal embedding space $e_i$ for the generation $r_i$ can be formed as $\mathbf{e}_i=\left[v_{1, i}, \ldots, v_{n, i}\right]$~\citep{ng2001spectral, von2007tutorial}. The average distance to the center is treated to measure uncertainty, named eccentricity ($U_{Ecc}$) which is written as:

\begin{equation}
\label{eq:ecc}
    U_{\text{Ecc}} =\left\|\left[\mathbf{e}_1^{\prime}{ }^{\top}, \ldots, \mathbf{e}_m^{\prime \top}\right]\right\|_2   
\end{equation}

in which $\mathbf{e}^{\prime}$ demonstrate the offset from the average embedding. Eccentricity has been applied to measure uncertainty for LLMs in a black-box way~\citep{lin2023generating}, and also to detect out-of-distribution generations in conditional language models~\citep{ren2022out}. However, utilizing NLI logits to represent semantic similarities is still questionable as logits are only predicted probabilities. It is necessary to apply features that represent semantic relations for text pairs. 

\paragraph{Contrastive Feature Extraction.} CLIP is a contrastive approach to learn the link between textual and visual representations, which is is trained on a large dataset of 400 million image-text pairs~\citep{radford2021learning}. It learns a multi-modal embedding space from the transformer based image-encoder and text-encoder, where semantically similar images and texts are also similar in the joint embedding space. As a foundation model trained on vast amount of data, it has shown great capabilities in tasks such as language-driven image generation~\citep{ramesh2022hierarchical}, zero-shot semantic segmentation~\citep{he2023clip} and text-guided image manipulation~\citep{hou2022feat}. %Despite impressive performances, the discrepancy between image and text embedding in the joint space may leading to undesired artifacts in the generated images~\citep{zhou2023clip}. 
Hence utilizing CLIP to learn semantic similarities between text pairs can be a plausible approach.  
%AS: The final "Hence" does not follow logically from the "Despite" sentece before it. Is the "Despite" sentence needed - it doesn't seem relevant?


\subsection{UQ with Contrastive Semantic Similarities}

In this section, we propose Contrastive Semantic Similarities (CSS): the CLIP-based semantic similarity features for text pairs. We then utilize CSS in Graph Laplacian (GL) to estimate uncertainty for LLMs. 


\paragraph{Contrastive Semantic Similarities} Initially, CLIP learns the relation of text-image pairs via the jointly trained image and text encoder. By connecting images and texts in the same space, the cosine similarity of the embeddings for correct related image-text pairs is minimized and vice versa. To extract contrastive semantic similarity features, we solely utilize the text encoder for text-pairs embeddings, which avoids the discrepancy of multi-modal embeddings in joint space. For the text pair $r_i, r_j$, we first utilize CLIP text-encoder to extract features for each of them, then conduct point-wise product (Hadamard Product) on the corresponding embeddings to obtain similarity features, as demonstrated in Figure~\ref{fig:clip}. 

As CLIP is based on contrastive approach, the obtained features represent contrastive relations between text pairs, which is called Contrastive Semantic Similarities (CSS) in our work. CSS feature maps maintain the same dimension as embeddings. For better semantic clustering with graph Laplacian, we then apply principal component analysis (PCA) to reduce dimensions of CSS feature maps. 

\begin{figure}[!h]
\centerline{\includegraphics[width=0.55\textwidth]{figs/flow_clip.pdf}}
\caption{The demonstration of obtaining our proposed contrastive semantic similarities for text pairs. After passing each generation to the CLIP text encoder, we conduct point-wise product on the corresponding embeddings to obtain the similarity features.}
\label{fig:clip}
\end{figure}

% CSS is CLIP-based Semantic Similarity Features for Text Pairs

\paragraph{Graph Laplacian with Contrastive Semantic Similarities} Let denote the CSS feature map for text-pair $r_i, r_j$ as $css_{r_i, r_j}$. Similar to estimating uncertainty with NLI logits, the symmetric weighted adjacency matrix for $R$ is $W^{css}$. $w^{css}_{i, j}$ is a scalar value obtained from the affinity matrix by projecting the similarity vector $css_{r_i, r_j}$. Suppose the degree matrix is $D^{css}$, the uncertainty for $m$ generations can be written as:

\begin{equation}
\label{eq:degree_sim}
    U_{\text {Deg }}^{css} =\operatorname{trace}(m-D^{css}) / m^2
\end{equation}

The graph Laplacian with CSS features then can be written as: $L^{css}:=D^{css} - W^{css}$. The ascending order eigenvalues are $\lambda^{css}_1 \leq \lambda^{css}_2 \leq \ldots \leq \lambda^{css}_n$, and corresponding eigenvectors are $v^{css}_1, v^{css}_2 \ldots, v^{css}_n$. Recalled that eigenvalues represent number of semantic clusters, based on Eq.~\ref{eq:semset}, the uncertainty $U^{css}_{set}$ is formed as:

\begin{equation}
\label{eq:semset_sim}
    U^{css}_{\text{Eig}} =\sum_{k=1}^m \max \left(0,1-\lambda^{css}_k\right)   
\end{equation}

The embedding space $e^{css}_i$ for generation $r_i$ now is formed with eigenvectors $v^{css}$ generated by similarity features. Given the offset the average embedding as $\mathbf{e^{css}}^{\prime}$, the eccentricity $U^{css}_{Ecc}$ as uncertainty is demonstrated as:

\begin{equation}
\label{eq:ecc_sim}
    U^{css}_{\text{Ecc}} =\left\|\left[\mathbf{e^{css}}_1^{\prime}{ }^{\top}, \ldots, \mathbf{e^{css}}_m^{\prime \top}\right]\right\|_2   
\end{equation}


We applied PCA to reduce dimension of CSS feature maps for better clustering, with feature dimension of 64 in our experiments. 


\section{Experiments}
\label{sec:exp}

\subsection{Dataset and Baselines}

We use the open-book conversational question answering dataset CoQA~\citep{reddy2019coqa}, a closed-book question answering dataset TriviaQA~\citep{joshi2017triviaqa}, and a challenging closed-book QA dataset Natural Questions (NQ)~\citep{kwiatkowski2019natural} for our experiments. We utilize the development/validation set for each dataset, respectively 7983, 9960 and 3610 samples for CoQA, TriviaQA and NQ. In terms of LLMs, LLaMA (with 13 billion parameters)~\citep{touvron2023llama}, OPT (with 13 billion parameters)~\citep{zhang2022opt}, and GPT (GPT-3.5-turbo) by OpenAI API are used to generate sampled responses for each question. For fair comparison, we use the official implementation~\footnote{https://github.com/lorenzkuhn/semantic\_uncertainty}~\footnote{https://github.com/zlin7/UQ-NLG} for all the baselines and we fixed the number of sampled generations of each question as $m = 20$.  

We compare our proposed method with the following state of the art techniques:

\begin{enumerate}

\item Lexical Similarity (LexiSim)~\citep{lin2004rouge}: measures the average Rouge-L score among sampled generations. 

\item Number of Semantically Distinct Answers (NumSem)~\citep{kuhn2023semantic}: leverages the count of semantically unique responses within correct and incorrect generations as a measure of uncertainty.


\item Semantic Entropy (SE)~\citep{kuhn2023semantic}: computes entropy over clusters
formed by semantically equivalent samples, which required the access of token-level logits/predicted probabilities from LLMs. 

\item P(true)~\citep{kadavath2022language}: estimates the probability of generations by querying the model itself if generations are true or false. This method utilizes the token-level logits, and we follow the experimental setup detailed in the originating study.

\item Graph Laplacian with NLI Classifier Logits (L-GL)~\citep{lin2023generating}: demonstrates that semantic dispersion can effectively estimate the quality of generations of LLMs. By utilizing NLI logits to cluster generations with similar semantic meaning, the uncertainty is measured by invariances of GL, respectively eigenvalues (EigV), degree matrix (Deg), eigenvectors (Ecc).

\end{enumerate}



\subsection{Implementation Details}

For our experiments, we use the pre-trained CLIP model \textit{openai/clip-vit-base-patch32} by using Huggingface library, which is trained on a dataset of about 400 million image-text pairs collected from the Internet. Our CSS takes about 2.3 seconds to calculate the UQ for a text pair, where previous work~\citep{lin2023generating} takes about 1.2 seconds. This demonstrates a minor computational additional resource for our method, which is still quite fast. For all our experiments, we use the 2 GPUs of Nvidia Tesla P40 with 23 GB RAM. Generating 20 responses for each question takes about 30 - 50 seconds.

\subsection{Evaluation Metrics}

Following the prior work of~\citep{kuhn2023semantic, lin2023generating}, we use the Rouge-L score and GPT correctness score as matching criteria to evaluate the correctness of generated responses. GPT correctness score is provided by \textit{gpt-3.5-turbo} from the OpenAI API, which assigns a correctness score between 0 and 1 for the similarity between given reference answer and generated responses. If the Rouge-L score for the generation and reference answer is larger than 0.3, the generation is considered to be correct. Similarly, the threshold for the GPT correctness score is 0.7. 

To validate our proposed method in terms of selective answering, we apply Area Under Accuracy-Rejection Curve (AUARC)~\citep{nadeem2009accuracy} as the evaluation metric. After applying baselines and our proposed method, each sample (one question with 20 sampled generations) obtains one score to represent the uncertainty. We rank all samples based on this score and reject higher-uncertainty ones to calculate accuracy for the remaining data. If the UQ method is effective and precise, samples with higher uncertainty are more likely to be wrongly predicted. Thus, the higher the AUARC, the better the quality of the UQ methods.  To further examine the overall performance of LLMs, we follow previous works~\citep{kuhn2023semantic, lin2023generating, band2022benchmarking} to employ Area Under Receiver Operating Characteristic (AUROC) to compare UQ methods. The uncertainty score for each sample serves as the threshold for calculating the sensitivity and specificity for the AUROC. A higher AUROC indicates lower uncertainty in LLMs, signifying that the sampled generations of a given question are more consistent.

\section{Results}





\begin{table*}[!ht]
\caption{Results of AUARC with Rouge-L score as the correctness criterion, on sampled generation by LLaMA, OPT and GPT on dataset of TriviaQA, CoQA and NQ. Results of white-box (WB) methods semantic entropy (SE) and p(true) on GPT generation are not available. WB methods require the predicted probabilites of outputs, which are not provided in ChatGPT API. All results are shown in percentages for clarity. Best results are in bold for each dataset.}
\label{tab:auarc_Rouge-L}
\centering
\scalebox{.9}{
\begin{tabular}{c|c|ccc|ccc|ccc}
\hline
Dataset                                & \multicolumn{1}{l|}{} & \multicolumn{3}{c|}{TriviaQA}                                     & \multicolumn{3}{c|}{CoQA}                                       & \multicolumn{3}{c}{NQ}                                          \\ \hline
Model                                  & \multicolumn{1}{l|}{} & \multicolumn{1}{c|}{LLaMA} & \multicolumn{1}{c|}{OPT}   & GPT   & \multicolumn{1}{c|}{LLaMA} & \multicolumn{1}{c|}{OPT}   & GPT   & \multicolumn{1}{c|}{LLaMA} & \multicolumn{1}{c|}{OPT}   & GPT   \\ \hline
\multicolumn{1}{l|}{}                  & Acc                   & \multicolumn{1}{c|}{57.57} & \multicolumn{1}{c|}{25.60} & 81.07 & \multicolumn{1}{c|}{55.96} & \multicolumn{1}{c|}{51.99} & 66.38 & \multicolumn{1}{c|}{19.32} & \multicolumn{1}{c|}{9.10}  & 39.83 \\ \hline
\multicolumn{1}{l|}{}                  & Oracle                & \multicolumn{1}{c|}{89.60} & \multicolumn{1}{c|}{54.30} & 97.91 & \multicolumn{1}{c|}{85.10} & \multicolumn{1}{c|}{78.56} & 93.00 & \multicolumn{1}{c|}{42.35} & \multicolumn{1}{c|}{24.15} & 75.58 \\ \hline
\multicolumn{1}{l|}{\multirow{2}{*}{}} & NumSem                & \multicolumn{1}{c|}{73.25} & \multicolumn{1}{c|}{33.76} & 81.07 & \multicolumn{1}{c|}{64.31} & \multicolumn{1}{c|}{57.29} & 67.81 & \multicolumn{1}{c|}{20.85} & \multicolumn{1}{c|}{10.58} & 45.97 \\ \cline{2-11} 
\multicolumn{1}{l|}{}                  & LexiSim               & \multicolumn{1}{c|}{78.98} & \multicolumn{1}{c|}{46.72} & 87.47 & \multicolumn{1}{c|}{79.09} & \multicolumn{1}{c|}{\textbf{73.15}} & 80.39 & \multicolumn{1}{c|}{35.74} & \multicolumn{1}{c|}{17.77} & 58.01 \\ \hline
\multirow{3}{*}{L-GL}                    & EigV                  & \multicolumn{1}{c|}{80.67} & \multicolumn{1}{c|}{48.70} & 92.32 & \multicolumn{1}{c|}{79.54} & \multicolumn{1}{c|}{71.96} & 84.34 & \multicolumn{1}{c|}{33.58} & \multicolumn{1}{c|}{14.72} & 62.13 \\ \cline{2-11} 
                                       & Ecc                   & \multicolumn{1}{c|}{80.20} & \multicolumn{1}{c|}{48.83} & 92.01 & \multicolumn{1}{c|}{78.92} & \multicolumn{1}{c|}{70.96} & 83.94 & \multicolumn{1}{c|}{34.26} & \multicolumn{1}{c|}{17.41} & 61.80 \\ \cline{2-11} 
                                       & Deg                   & \multicolumn{1}{c|}{80.71} & \multicolumn{1}{c|}{49.00} & 92.24 & \multicolumn{1}{c|}{79.12} & \multicolumn{1}{c|}{71.83} & 84.22 & \multicolumn{1}{c|}{34.23} & \multicolumn{1}{c|}{17.49} & 62.42 \\ \hline
\multirow{2}{*}{WB}                    & SE                    & \multicolumn{1}{c|}{74.09} & \multicolumn{1}{c|}{47.90} & –     & \multicolumn{1}{c|}{77.65} & \multicolumn{1}{c|}{67.46} & –     & \multicolumn{1}{c|}{28.97} & \multicolumn{1}{c|}{16.62} & –     \\ \cline{2-11} 
                                       & P(true)               & \multicolumn{1}{c|}{61.85} & \multicolumn{1}{c|}{20.93} & –     & \multicolumn{1}{c|}{61.75} & \multicolumn{1}{c|}{58.32} & –     & \multicolumn{1}{c|}{20.19} & \multicolumn{1}{c|}{8.27}  & –     \\ \hline
\multirow{3}{*}{Ours (CSS)}                  & CSS-EigV              & \multicolumn{1}{c|}{81.47} & \multicolumn{1}{c|}{49.85} & 92.70 & \multicolumn{1}{c|}{\textbf{81.92}} & \multicolumn{1}{c|}{72.13} & 87.26 & \multicolumn{1}{c|}{\textbf{36.80}} & \multicolumn{1}{c|}{18.10} & 64.83 \\ \cline{2-11} 
                                       & CSS-Ecc               & \multicolumn{1}{c|}{81.29} & \multicolumn{1}{c|}{49.60} & 93.07 & \multicolumn{1}{c|}{80.83} & \multicolumn{1}{c|}{71.36} & \textbf{87.34} & \multicolumn{1}{c|}{36.62} & \multicolumn{1}{c|}{18.19} & \textbf{65.04} \\ \cline{2-11} 
                                       & CSS-Deg               & \multicolumn{1}{c|}{\textbf{81.55}} & \multicolumn{1}{c|}{\textbf{50.08}} & \textbf{93.18} & \multicolumn{1}{c|}{81.17} & \multicolumn{1}{c|}{73.18} & 87.02 & \multicolumn{1}{c|}{36.67} & \multicolumn{1}{c|}{\textbf{18.34}} & 64.87 \\ \hline
\end{tabular}}
\end{table*}



% --------------------------------------------------------------------------------------


\begin{table*}[!h]
\caption{Results of AUROC with Rouge-L score as the correctness criterion, on sampled generation by LLaMA, OPT and GPT on dataset of TriviaQA, CoQA and NQ. All results are shown in percentages for clarity. Best results are in bold for each dataset. }
\label{tab:auroc_Rouge-L}
\centering
\scalebox{.9}{
\begin{tabular}{c|c|ccc|ccc|ccc}
\hline
Dataset                                & \multicolumn{1}{l|}{} & \multicolumn{3}{c|}{TriviaQA}                                     & \multicolumn{3}{c|}{CoQA}                                       & \multicolumn{3}{c}{NQ}                                          \\ \hline
Model                                  & \multicolumn{1}{l|}{} & \multicolumn{1}{c|}{LLaMA} & \multicolumn{1}{c|}{OPT}   & GPT   & \multicolumn{1}{c|}{LLaMA} & \multicolumn{1}{c|}{OPT}   & GPT   & \multicolumn{1}{c|}{LLaMA} & \multicolumn{1}{c|}{OPT}   & GPT   \\ \hline
\multicolumn{1}{l|}{\multirow{2}{*}{}} & NumSem                & \multicolumn{1}{c|}{75.06} & \multicolumn{1}{c|}{68.56} & 68.20 & \multicolumn{1}{c|}{57.76} & \multicolumn{1}{c|}{57.60} & 51.69 & \multicolumn{1}{c|}{55.59} & \multicolumn{1}{c|}{59.20} & 61.13 \\ \cline{2-11} 
\multicolumn{1}{l|}{}                  & LexiSim               & \multicolumn{1}{c|}{77.63} & \multicolumn{1}{c|}{76.48} & 81.13 & \multicolumn{1}{c|}{75.72} & \multicolumn{1}{c|}{76.40} & 68.70 & \multicolumn{1}{c|}{76.72} & \multicolumn{1}{c|}{73.90} & 71.65 \\ \hline
\multirow{3}{*}{L-GL}                    & EigV                  & \multicolumn{1}{c|}{84.35} & \multicolumn{1}{c|}{82.88} & \textbf{83.40} & \multicolumn{1}{c|}{77.95} & \multicolumn{1}{c|}{75.70} & 78.65 & \multicolumn{1}{c|}{72.59} & \multicolumn{1}{c|}{73.88} & 80.88 \\ \cline{2-11} 
                                       & Ecc                   & \multicolumn{1}{c|}{83.66} & \multicolumn{1}{c|}{83.91} & 82.50 & \multicolumn{1}{c|}{77.26} & \multicolumn{1}{c|}{74.81} & 77.39 & \multicolumn{1}{c|}{74.44} & \multicolumn{1}{c|}{76.02} & 79.82 \\ \cline{2-11} 
                                       & Deg                   & \multicolumn{1}{c|}{84.52} & \multicolumn{1}{c|}{83.36} & 82.93 & \multicolumn{1}{c|}{77.53} & \multicolumn{1}{c|}{75.85} & 78.76 & \multicolumn{1}{c|}{74.01} & \multicolumn{1}{c|}{74.75} & \textbf{81.31} \\ \hline
\multirow{2}{*}{WB}                    & SE                    & \multicolumn{1}{c|}{74.39} & \multicolumn{1}{c|}{81.54} & -- & \multicolumn{1}{c|}{74.55} & \multicolumn{1}{c|}{71.25} & -- & \multicolumn{1}{c|}{69.50} & \multicolumn{1}{c|}{74.61} & -- \\ \cline{2-11} 
                                       & P(true)               & \multicolumn{1}{c|}{55.12} & \multicolumn{1}{c|}{41.64} & -- & \multicolumn{1}{c|}{55.14} & \multicolumn{1}{c|}{52.67} & -- & \multicolumn{1}{c|}{52.52} & \multicolumn{1}{c|}{47.92} & -- \\ \hline
\multirow{3}{*}{Ours (CSS)}                                  & CSS-EigV              & \multicolumn{1}{c|}{85.52} & \multicolumn{1}{c|}{85.37} & 82.27 & \multicolumn{1}{c|}{\textbf{78.78}} & \multicolumn{1}{c|}{\textbf{77.19}} & 80.04 & \multicolumn{1}{c|}{\textbf{76.08}} & \multicolumn{1}{c|}{\textbf{77.08}} & 79.28 \\ \cline{2-11} 
\multicolumn{1}{l|}{}                  & CSS-Ecc               & \multicolumn{1}{c|}{85.17} & \multicolumn{1}{c|}{84.97} & 81.57 & \multicolumn{1}{c|}{78.40} & \multicolumn{1}{c|}{76.70} & \textbf{80.40} & \multicolumn{1}{c|}{75.76} & \multicolumn{1}{c|}{76.53} & 79.91 \\ \cline{2-11} 
\multicolumn{1}{l|}{}                  & CSS-Deg               & \multicolumn{1}{c|}{\textbf{85.63}} & \multicolumn{1}{c|}{\textbf{85.82}} & 81.77 & \multicolumn{1}{c|}{78.68} & \multicolumn{1}{c|}{76.95} & 79.12 & \multicolumn{1}{c|}{75.81} & \multicolumn{1}{c|}{77.25} & 80.01 \\ \hline
\end{tabular}}
\end{table*}


\begin{table}[!h]
\caption{Results of AUARC with GPT score as the correctness criterion, on sampled generation by LLaMA and OPT on dataset of TriviaQA, CoQA and NQ. ACC is accuracy, NumSem is Number of Semantically Distinct Answers (NumSem), and LexiSim means Lexical Similarity. L-GL is Graph Laplacian with NLI Classifier Logits, including EigV, Ecc and Deg sub-methods. WB means white-box methods as semantic entropy (SE) and p(true) require token-level logits access. Our proposed methods include three sub-methods, CSS-EigV, CSS-Ecc, and CSS-Deg, where CSS stands for contrastive semantic similarity. All results are shown in percentages for clarity. Best results are in bold for each dataset. }
\label{tab: auarc_gpt}
% \centering
\scalebox{.71}{
\begin{tabular}{c|c|cc|cc|cc}
\hline
                                & \multicolumn{1}{l|}{} & \multicolumn{2}{c|}{TriviaQA}        & \multicolumn{2}{c|}{CoQA}          & \multicolumn{2}{c}{NQ}             \\ \hline
                                  & \multicolumn{1}{l|}{} & \multicolumn{1}{c|}{LLaMA} & OPT   & \multicolumn{1}{c|}{LLaMA} & OPT   & \multicolumn{1}{l|}{LLaMA} & OPT   \\ \hline
\multicolumn{1}{l|}{}                  & Acc                   & \multicolumn{1}{c|}{61.18} & 25.75 & \multicolumn{1}{c|}{62.46} & 51.81 & \multicolumn{1}{c|}{23.63} & 8.60  \\ \hline
\multicolumn{1}{l|}{}                  & Oracle                & \multicolumn{1}{c|}{87.03} & 54.72 & \multicolumn{1}{c|}{86.29} & 79.41 & \multicolumn{1}{c|}{47.67} & 23.28 \\ \hline
\multicolumn{1}{l|}{\multirow{2}{*}{}} & NumSem                & \multicolumn{1}{c|}{78.78} & 39.46 & \multicolumn{1}{c|}{67.58} & 60.41 & \multicolumn{1}{c|}{28.18} & 10.36 \\ \cline{2-8} 
\multicolumn{1}{l|}{}                  & LexiSim               & \multicolumn{1}{c|}{80.32} & 45.68 & \multicolumn{1}{c|}{78.17} & 71.46 & \multicolumn{1}{c|}{40.15} & 15.92 \\ \hline
\multirow{3}{*}{L-GL}                    & EigV                  & \multicolumn{1}{c|}{83.52} & 50.54 & \multicolumn{1}{c|}{80.21} & 72.46 & \multicolumn{1}{c|}{40.02} & 17.20 \\ \cline{2-8} 
                                       & Ecc                   & \multicolumn{1}{c|}{83.64} & 50.42 & \multicolumn{1}{c|}{80.14} & 71.73 & \multicolumn{1}{c|}{40.16} & 17.82 \\ \cline{2-8} 
                                       & Deg                   & \multicolumn{1}{c|}{84.61} & 51.06 & \multicolumn{1}{c|}{79.34} & 72.51 & \multicolumn{1}{c|}{40.81} & 17.43 \\ \hline
\multirow{2}{*}{WB}                    & SE                    & \multicolumn{1}{c|}{79.15} & 51.11 & \multicolumn{1}{c|}{78.83} & 70.75 & \multicolumn{1}{c|}{36.03} & 17.40 \\ \cline{2-8} 
                                       & P(true)               & \multicolumn{1}{c|}{64.98} & 20.25 & \multicolumn{1}{c|}{64.04} & 50.23 & \multicolumn{1}{c|}{24.72} & 7.63  \\ \hline
\multirow{3}{*}{Ours}                  & CSS-EigV              & \multicolumn{1}{c|}{84.76} & 50.16 & \multicolumn{1}{c|}{81.21} & 73.67 & \multicolumn{1}{c|}{41.15} & 18.20 \\ \cline{2-8} 
                                       & CSS-Ecc               & \multicolumn{1}{c|}{84.95} & 51.24 & \multicolumn{1}{c|}{\textbf{82.66}} & \textbf{73.38} & \multicolumn{1}{c|}{\textbf{42.39}} & \textbf{18.65} \\ \cline{2-8} 
                                       & CSS-Deg               & \multicolumn{1}{c|}{\textbf{86.03}} & \textbf{52.35} & \multicolumn{1}{c|}{81.28} & 72.96 & \multicolumn{1}{c|}{41.76} & 18.59 \\ \hline
\end{tabular}}
\end{table}


Table~\ref{tab:auarc_Rouge-L} presents the AUARC results of sampled generations on the TriviaQA, CoQA, and NQ datasets using LLMs, with the Rouge-L score serving as the criterion for correctness. The results for white-box (WB) methods of semantic entropy (SE) and p(true) depend on token-level probabilities. As the ChatGPT API does not provide these, we are unable to report the corresponding AUARC and AUROC results. 



When the model is perfectly calibrated, all rejected samples will be the wrong ones. In the table, Oracle represents this upper bound on AUARC performance.

As the only rule-based measurement (utilizing Rouge-L) among all methods, lexical similarity demonstrates a superior capability in estimating uncertainty compared to the Number of Semantically Distinct Answers (NumSem) in most cases. This suggests that variations in vocabulary or grammar contribute significantly to semantic meanings. For NLI logits-based methods, all three (labelled EigV, Ecc and Deg) sub-methods in NLI logits-based graph Laplacian perform better than semantic entropy in LLaMA-generated datasets. 

The performance of our proposed CCS graph Laplacian is, on average, 1.5\% to 2\% higher than that of L-GL. CSS-Deg achieves better performance than CSS-EigV and CSS-Ecc in most cases. 

The ARC depicted in Figure~\ref{fig:coca_opt} illustrates how the eccentricity of our method (Ecc (ours)) outperforms other baselines in OPT-sampled generations for the CoQA dataset. As the rejection rate increases, our method demonstrates superior performance compared to other approaches, indicating improved uncertainty estimation through our contrastive technique. The AUROC results in table~\ref{tab:auroc_Rouge-L} are mostly consistent with AUARC results, where our proposed CSS-Eigv obtain highest performance in most cases.

Table~\ref{tab: auarc_gpt} presents the AUARC results of sampled generations on the TriviaQA, CoQA, and NQ datasets using LLMs, with the GPT score as the correctness criterion. Responses generated by GPT across all datasets are omitted due to their exceptionally high accuracy (over 95\%) -- it would be unfair to compare GPT generations with those from LLaMA and OPT. Our method outperforms other baselines, where CSS-Ecc shows more improvements to estimate uncertainty for generations of LLMs.


In summary, the graph Laplacian methods (L-GL and ours) outperform the white-box methods of semantic entropy and p(true), demonstrating the effectiveness of spectral clustering in analyzing semantic relations. Our proposed method exhibits superior uncertainty estimation compared to L-GL, indicating more effective extraction of semantic relations through the contrastive method.


% \begin{table}[!h]
% \caption{Results of AUROC with GPT score as the correctness criterion, on sampled generation by LLaMa and OPT on dataset of TriviaQA, CoQA and NQ.  All results are shown in percentages for clarity. Best results are in bold for each dataset.}
% \label{tab:auroc_gpt}
% \centering
% \scalebox{.7}{
% \begin{tabular}{c|c|cc|cc|cc}
% \hline
% Dataset                                & \multicolumn{1}{l|}{} & \multicolumn{2}{c|}{TriviaQA}        & \multicolumn{2}{c|}{CoQA}          & \multicolumn{2}{c}{NQ}             \\ \hline
% Model                                  & \multicolumn{1}{l|}{} & \multicolumn{1}{c|}{LLaMA} & OPT   & \multicolumn{1}{c|}{LLaMA} & OPT   & \multicolumn{1}{l|}{LLaMA} & OPT   \\ \hline
% \multicolumn{1}{l|}{\multirow{2}{*}{}} & NumSem                & \multicolumn{1}{c|}{78.79} & 75.02 & \multicolumn{1}{c|}{59.34} & 59.72 & \multicolumn{1}{c|}{60.61} & 59.45 \\ \cline{2-8} 
% \multicolumn{1}{l|}{}                  & LexiSim               & \multicolumn{1}{c|}{75.92} & 74.49 & \multicolumn{1}{c|}{71.09} & 71.81 & \multicolumn{1}{c|}{75.19} & 69.66 \\ \hline
% \multirow{3}{*}{L-GL}                    & EigV                  & \multicolumn{1}{c|}{84.90} & 86.09 & \multicolumn{1}{c|}{77.61} & 75.61 & \multicolumn{1}{c|}{75.05} & 76.62 \\ \cline{2-8} 
%                                        & Ecc                   & \multicolumn{1}{c|}{\textbf{86.43}} & \textbf{86.86} & \multicolumn{1}{c|}{76.32} & 74.25 & \multicolumn{1}{c|}{74.66} & 79.37 \\ \cline{2-8} 
%                                        & Deg                   & \multicolumn{1}{c|}{84.21} & 86.60 & \multicolumn{1}{c|}{\textbf{77.35}} & \textbf{76.06} & \multicolumn{1}{c|}{74.79} & 77.44 \\ \hline
% \multirow{2}{*}{WB}                    & SE                    & \multicolumn{1}{c|}{76.61} & 84.38 & \multicolumn{1}{c|}{73.32} & 72.71 & \multicolumn{1}{c|}{69.17} & 77.16 \\ \cline{2-8} 
%                                        & P(true)               & \multicolumn{1}{c|}{59.21} & 41.76 & \multicolumn{1}{c|}{55.39} & 45.92 & \multicolumn{1}{c|}{50.35} & 44.74 \\ \hline
% \multirow{3}{*}{Ours}                  & CSS-EigV              & \multicolumn{1}{c|}{84.63} & 84.70 & \multicolumn{1}{c|}{74.45} & 73.31 & \multicolumn{1}{c|}{74.77} & 77.04 \\ \cline{2-8} 
%                                        & CSS-Ecc               & \multicolumn{1}{c|}{83.88} & 85.14 & \multicolumn{1}{c|}{75.32} & 73.45 & \multicolumn{1}{c|}{73.83} & 77.74 \\ \cline{2-8} 
%                                        & CSS-Deg               & \multicolumn{1}{c|}{85.20} & 85.85 & \multicolumn{1}{c|}{76.25} & 74.40 & \multicolumn{1}{c|}{\textbf{75.63}} &\textbf{ 79.59} \\ \hline
% \end{tabular}}
% \end{table}




% --------------------------------------------------------------------------------------
% Please add the following required packages to your document preamble:
% \usepackage{multirow}


\begin{figure}[!h]
\centerline{\includegraphics[width=0.55\textwidth]{figs/coqa_opt_auarc.pdf}}
\caption{The accuracy-rejection curve for OPT sampled generations for CoQA, with Rouge-L>0.3 as the correctness criterion to obtain the base accuracy. After ranking samples based on their uncertainty scores obtained by the listed methods, we reject samples with higher uncertainty and calculate the accuracy for the remaining data. Oracle represents the highest performance of the model, where the model is perfectly calibrated and all rejected samples are wrongly predicted. We compare the Eccentricity of our method (ECC (ours)) with other baselines, namely p(true), semantic entropy (SE), Eccentricity in L-GL (Ecc (L-GL)), and number of Semantically distinct answers (NumSem).}
\label{fig:coca_opt}
\end{figure}

% --------------------------------------------------------------------------------------



% --------------------------------------------------------------------------------------
\section{Ablation Study}


We conducted an extensive ablation study alongside our main experiments to evaluate the necessity of applying dimension reduction to Contrastive Semantic Similarity (CSS) feature maps. We tested CSS feature maps of various dimensions on sampled generations from LLaMA, OPT, and GPT on the CoQA dataset, employing the Eccentricity metric of our proposed method. We utilized the original CSS features with a dimension of 512, and reduced dimensions of 128 and 64 using PCA and UMAP~\citep{mcinnes2018umap} techniques. The results, as shown in Table~\ref{tab:pca_umap}, indicate that a certain level of dimension reduction can enhance AUARC and AUROC results, suggesting benefits for improved clustering. UMAP's results were slightly inferior to those of PCA; therefore, we used the reduced dimension of 64 by PCA for our main experiments.


% --------------------------------------------------------------------------------------
% Please add the following required packages to your document preamble:
% \usepackage{multirow}
% \usepackage[normalem]{ulem}
% \useunder{\uline}{\ul}{}
\begin{table}[t]
\caption{Results of feature reduction on Eccentricity in proposed method (CSS-Ecc) with LLaMA, OPT, and GPT for sampled generations on CoQA Dataset. The original feature dimension is 512, which then reduced to 128 and 64 via PCA and UMAP. Results for our proposed method is underscored.}
\label{tab:pca_umap}
\scalebox{.75}{
\begin{tabular}{c|c|ccc|ccc}
\hline
\multicolumn{1}{l|}{} & \multicolumn{1}{l|}{} & \multicolumn{3}{c|}{AUARC}                                                        & \multicolumn{3}{c}{AUROC}                                                         \\ \hline
\multicolumn{1}{l|}{} & Features              & \multicolumn{1}{c|}{LLaMA}       & \multicolumn{1}{c|}{OPT}         & GPT         & \multicolumn{1}{c|}{LLaMA}       & \multicolumn{1}{c|}{OPT}         & GPT         \\ \hline
Original              & 512                   & \multicolumn{1}{c|}{80.17}       & \multicolumn{1}{c|}{71.04}       & 84.23       & \multicolumn{1}{c|}{77.28}       & \multicolumn{1}{c|}{74.92}       & 77.45       \\ \hline
\multirow{2}{*}{PCA}  & 128                   & \multicolumn{1}{c|}{80.54}       & \multicolumn{1}{c|}{71.12}       & 86.89       & \multicolumn{1}{c|}{78.25}       & \multicolumn{1}{c|}{76.75}       & 78.52       \\ \cline{2-8} 
                      & 64                    & \multicolumn{1}{c|}{{\ul 80.83}} & \multicolumn{1}{c|}{{\ul 71.36}} & {\ul 87.34} & \multicolumn{1}{c|}{{\ul 78.40}} & \multicolumn{1}{c|}{{\ul 76.70}} & {\ul 80.04} \\ \hline
\multirow{2}{*}{UMAP} & 128                   & \multicolumn{1}{c|}{79.95}       & \multicolumn{1}{c|}{71.09}       & 84.22       & \multicolumn{1}{c|}{78.15}       & \multicolumn{1}{c|}{75.21}       & 77.85       \\ \cline{2-8} 
                      & 64                    & \multicolumn{1}{c|}{80.52}       & \multicolumn{1}{c|}{71.16}       & 85.64       & \multicolumn{1}{c|}{78.46}       & \multicolumn{1}{c|}{75.67}       & 79.58       \\ \hline
\end{tabular}}
\end{table}

% --------------------------------------------------------------------------------------
% Please add the following required packages to your document preamble:
% \usepackage{multirow}
\begin{table}[]
\caption{Comparison of utilizing NLI logits and NLI feature maps based graph Laplacian on sampled generations of LLaMA and GPT on TriviaQA dataset. The AUARC and AUROC results are based on GPT score for correctness. 'L-GL' denotes the graph Laplacian based on NLI logits, and 'F-GL' represents the graph Laplacian based on NLI feature maps. These results are compared across three sub-methods of uncertainty quantification (UQ): Eigenvalue (EigV), Eccentricity (Ecc), and Degree Metric (Deg).}
\label{tab:deberta_feature}
\scalebox{.9}{
\begin{tabular}{c|c|cc|cc}
\hline
\multicolumn{1}{l|}{} & \multicolumn{1}{l|}{} & \multicolumn{2}{c|}{AUARC}         & \multicolumn{2}{c}{AUROC}          \\ \hline
\multicolumn{1}{l|}{} & \multicolumn{1}{l|}{} & \multicolumn{1}{c|}{LLaMA} & OPT   & \multicolumn{1}{c|}{LLaMA} & OPT   \\ \hline
\multirow{3}{*}{L-GL} & EigV                  & \multicolumn{1}{c|}{83.52} & 50.54 & \multicolumn{1}{c|}{84.90} & 86.09 \\ \cline{2-6} 
                      & Ecc                   & \multicolumn{1}{c|}{83.64} & 50.42 & \multicolumn{1}{c|}{86.43} & 86.86 \\ \cline{2-6} 
                      & Deg                   & \multicolumn{1}{c|}{84.61} & 51.06 & \multicolumn{1}{c|}{84.21} & 86.60 \\ \hline
\multirow{3}{*}{F-GL} & EigV                  & \multicolumn{1}{c|}{83.54} & 50.48 & \multicolumn{1}{c|}{84.95} & 85.92 \\ \cline{2-6} 
                      & Ecc                   & \multicolumn{1}{c|}{83.62} & 51.62 & \multicolumn{1}{c|}{86.53} & 86.95 \\ \cline{2-6} 
                      & Deg                   & \multicolumn{1}{c|}{84.65} & 51.36 & \multicolumn{1}{c|}{84.16} & 87.12 \\ \hline
\end{tabular}}
\end{table}



Moreover, we argue that NLI classifier logits lack substantial semantic clustering information, as they represent predicted probabilities. To verify this claim, we compared the NLI logits-based graph Laplacian (L-GL) with NLI feature maps extracted from the off-the-shelf DeBERTa model~\citep{he2020deberta} as the basis for the graph Laplacian (F-GL) on sampled generations from LLaMA and GPT on the TriviaQA dataset. The results for EigV, Ecc, and Deg, as presented in Table~\ref{tab:deberta_feature}, show that the overall performance of F-GL is marginally better than that of L-GL, indicating that feature maps contain more clustering information than mere probabilities.

\begin{table}[!ht]
\caption{Comparing of CLIP text encoder and language models of BERT, DeBERTa and Sentence-BERT for feature embedding with TriviaQA dataset on LLaMA sampled generations. The AUARC and AUROC results are based on GPT score for correctness on evaluation metric Eccentricity (CSS-Ecc). All feature embeddings are without feature reduction, and the best result is in bold. }
\label{tab:compare_LM}
\centering
\begin{tabular}{l|l|l}
\hline
Model         & AUARC & AUROC \\ \hline
BERT          & 83.78 & 86.24 \\ \hline
DeBERTa       & 83.62 & 86.53 \\ \hline
Sentence-BERT & 83.72 & 87.02 \\ \hline
CLIP          & \textbf{84.32} & \textbf{87.19} \\ \hline
\end{tabular}
\end{table}

CLIP is trained on a contrastive objective using a dataset containing image-caption pairs, where the text encoder is specifically trained on image captions. Despite both using textual data, the domain of image captions can differ significantly from the NLP corpus that language models are trained on. As a result, employing such an image-caption-focused text embedding to evaluate text generated by LLMs may raise concerns. We therefore conducted experiments to compare CLIP text encoder and regular Language Models BERT~\citep{devlin2018bert}, DeBERTa~\citep{he2020deberta} and Sentence-BERT~\citep{reimers2019sentence}. We used these language models for feature embedding with TriviaQA dataset on LLaMA sampled generations evaluated with GPT correctness score. Table~\ref{tab:compare_LM} shows that CLIP outperforms other language models, suggesting CLIP yields more accurate text-text similarity assessments. 


\begin{table}[t]
\caption{Comparing Rouge-L and METEOR as correctness criteria on generated responses by LLaMA on the TriviaQA dataset. The AUARC and AUROC results are on evaluation metric of Eccentricity (Ecc) in L-GL and ours.}
\label{tab:meteor}
\begin{tabular}{l|l|l|l}
\hline
Evaluation   Metric      & Method & AUARC & AUROC \\ \hline
\multirow{2}{*}{Rouge-L} & L-GL   & 80.20 & 83.66 \\ \cline{2-4} 
                         & Ours   & 81.29 & 85.17 \\ \hline
\multirow{2}{*}{METEOR}  & L-GL   & 80.32 & 83.79 \\ \cline{2-4} 
                         & Ours   & 81.35 & 85.22 \\ \hline
\end{tabular}
\end{table}



Following previous work, we utilized Rouge-L as the correctness measurement for sampled generations of LLMs, to ensure a fair comparison with previous studies~\citep{kuhn2023semantic,lin2023generating}. However, as the n-gram based metric, Rouge-L may fail to evaluate lexical different but semantically similar sentences, whereas LLM-generated text is more semantically driven. To address this limitation, we employed METEOR, a metric that incorporates more semantic features than simple lexical overlap, as an evaluation criterion for the generated responses of LLMs. Interestingly, results in Table~\ref{tab:meteor} shows that METEOR follow a similar trend with Rouge-L in AUROC and AUARC.



\section{Discussion}


The empirical evidence shown in tables~\ref{tab:auarc_Rouge-L},~\ref{tab:auroc_Rouge-L} and \ref{tab: auarc_gpt} from extensive experiments across multiple datasets and evaluation metrics firmly establishes the superiority of our proposed contrastive semantic similarity over existing methods, especially the most recent NLI logits-based approaches. Our proposed CLIP-based semantic similarity further learns contrastive features between text pairs, demonstrating better semantic clustering compared to the baselines. 


Our extensive ablation study and novel findings demonstrate the superiority of our intuitive and simple, yet effective, approach. The study presented in Table~\ref{tab:pca_umap} demonstrates the benefits of dimension reduction on feature maps for improved clustering with techniques of PCA and UMAP. Additionally, NLI logits represent the predicted probabilities for the labels "entailment," "contradiction," and "neutral." However, they lack comprehensive latent semantic features, as logits are primarily trained to identify labels. Semantic relationships between text pairs can be implicit, making a feature map a better semantic representation than NLI logits. Our ablation study, shown in Table~\ref{tab:deberta_feature}, compares NLI logits and the NLI feature map from DeBERTa, revealing that feature maps contain more potential semantic clustering information than logits. 

Furthermore, we investigate the effectiveness of text feature extraction between CLIP and regular language models, with results shown in Table~\ref{tab:compare_LM}. CLIP's ability to extract contrastive features from input pairs has made it widely applicable for understanding alignments between image-text pairs~\citep{radford2021learning, ramesh2022hierarchical, hou2022feat} and as well as between image-image pairs~\citep{yu2024tf}. While previous studies have focused on CLIP's use with image-text and image-image data, we extend its application to investigate image-free contrastive semantic feature extraction, thus broadening the scope for which CLIP can be utilized. 

In addition, to address the limitations of the n-gram-based metric ROUGE-L, we incorporate METEOR as an evaluation criterion for the generated responses to better capture semantic similarities. Although both metrics reveal similar trends in their results shown in Table~\ref{tab:meteor}, the use of METEOR provides a more accurate and meaningful evaluation than ROUGE-L. We will expand our experiments to include additional metrics that more effectively account for semantic features, thereby providing a more accurate evaluation of generated text of LLMs. 



Our research also contributes to the ongoing discourse on the trustworthiness of large language models, offering a pragmatic solution to the challenge of selective answering in question-answering systems. The demonstrated efficacy of our CSS (shown in Figure~\ref{fig:coca_opt}) in identifying and rejecting unreliable generations holds significant implications for the development of more trusted LLMs applications. 


In terms of bias and fairness, vision-language models invariably exhibit varying degrees of bias, as highlighted in the work~\citep{radford2021learning} using the FairFace dataset, which includes race, gender, and age subgroups. However, our current datasets lack subgroup information, and our primary objective is to develop an effective UQ method for LLMs. In future research, we will address bias and fairness issues by incorporating relevant subgroup data into our uncertainty estimation processes, ensuring a comprehensive evaluation and improvement of fairness in our models.



\section{Conclusion}

In this paper, we proposed a novel UQ technique for LLMs using Contrastive Semantic Similarity (CSS) to capture insightful semantic relationships between text pairs. By adapting the CLIP text encoder and utilizing spectral clustering, our method accurately estimates the uncertainty of LLM-generated responses than SOTA techniques.

Our extensive experiments demonstrated that our approach outperforms existing UQ methods, revealing richer semantic information than NLI logits. We also showed the superiority of contrastive feature extraction of CLIP over regular language models, expanding its application scope in language generation. Furthermore, our exploration of the METEOR metric provided a more comprehensive assessment of semantic relationships compared to ROUGE-L, enhancing evaluation criteria for generated texts. Our method also improves selective NLG by more effectively identifying unreliable responses. Future work will focus on further confidence and uncertainty calibration techniques and exploring the application of our method to a broader range of NLG tasks.










% References
\bibliography{bib}

% \newpage

% \onecolumn

% \title{Title in Title Case\\(Supplementary Material)}
% \maketitle



% This Supplementary Material should be submitted together with the main paper.

% \appendix
% \section{Additional simulation results}

\end{document}
