%\documentclass{uai2025} % for initial submission
\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage[table,dvipsnames]{xcolor}
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{comment}
\usepackage{arydshln} % for dashed/dotted lines in tables
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Over the Top-1: Uncertainty-Aware Cross-Modal Retrieval with CLIP}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author{\href{mailto:<lgomez@cvc.uab.es>}{Lluis Gomez}}
% Add affiliations after the authors
\affil{
    Computer Vision Center\\
    Universitat Autònoma de Barcelona.
}
  
  \begin{document}
\maketitle

\begin{abstract}
State-of-the-art vision-language models, such as CLIP, achieve remarkable performance in cross-modal retrieval tasks, yet estimating their predictive uncertainty remains an open challenge. While recent works have explored probabilistic embeddings to quantify retrieval uncertainty, these approaches often require model retraining or fine-tuning adapters, making them computationally expensive and dataset-dependent. In this work, we propose a training-free framework for uncertainty estimation in cross-modal retrieval. We start with a simple yet effective baseline that uses the cosine similarity between a query and its top-ranked match as an uncertainty measure. Building on this, we introduce a method that estimates uncertainty by analyzing the consistency of the top-1 retrieved item across samples drawn from the posterior predictive distribution using Monte Carlo Dropout (MCD) or Deep Ensembles. Finally, we propose an adversarial perturbation-based method, where the minimal perturbation required to alter the top-1 retrieval serves as a robust indicator of uncertainty. Empirical results in two standard cross-modal retrieval benchmarks demonstrate that our approach achieves superior calibration compared to learned probabilistic methods, all while incurring zero additional training cost.
  
\end{abstract}

\section{Introduction}\label{sec:intro}

Cross-modal retrieval systems enable the retrieval of information across different modalities, such as using a text query to find matching images or vice versa. This capability has gained significant momentum in recent years, driven by the growing need to efficiently search and manage information in increasingly large multimodal databases. The ability to bridge the gap between different modalities has become a cornerstone of modern artificial intelligence applications, ranging from image search engines to multimodal content understanding.

State-of-the-art deep learning models for cross-modal retrieval, such as CLIP (Contrastive Language-Image Pre-training) (\cite{radford2021learning}), ALIGN (\cite{li2021align}), Flamingo (\cite{alayrac2022flamingo}), and BLIP (\cite{li2023blip}), have demonstrated remarkable performance in tasks like image retrieval, text-to-image synthesis, and multimodal reasoning. These models are typically trained by maximizing the agreement between image and text representations, leveraging large-scale datasets to learn rich, semantically aligned embeddings. Their impressive performance on benchmark datasets, such as MSCOCO (\cite{lin2014microsoft}) and Flickr30K (\cite{young2014image}), underscores their effectiveness in bridging visual and textual information.

However, a critical aspect often overlooked in the evaluation of cross-modal retrieval models is the uncertainty associated with their predictions. Standard evaluation metrics, such as Recall@K, focus solely on the accuracy of the retrieval results, providing no insight into how confident the model is in its predictions. In real-world applications, where retrieval errors can have significant consequences -- such as in medical imaging, autonomous systems, or content moderation -- understanding the model’s uncertainty is as important as its accuracy.

Reliable measures of predictive uncertainty are essential for distinguishing between confident, trustworthy predictions and those where the model may be uncertain or even erroneous. Quantifying uncertainty in cross-modal retrieval is particularly challenging due to the complex interactions between modalities, where ambiguity can arise not only from the model but also from the data, the task, and the inherent variability in semantic alignment across different modalities. 

In this work, we focus on \textit{epistemic uncertainty} --  lack of knowledge about the correct mapping from inputs to outputs. In Bayesian deep learning, epistemic uncertainty is captured by placing a distribution over model weights and observing the spread in predictions this induces. The variance of predictions across posterior samples directly quantifies the model's epistemic uncertainty. Monte Carlo Dropout (\cite{gal2016dropout}) and Deep Ensembles (\cite{lakshminarayanan2017simple}) are practical posterior-sampling techniques that make prediction variability observable, providing a Bayesian estimate of model uncertainty and helping to mitigate the overconfidence of a single deterministic network. Intuitively, if the model's belief about the best prediction is unstable across different sampled weight configurations, it signals a lack of knowledge. In classification tasks, epistemic uncertainty can be measured by the consistency of model's predictions: high posterior confidence implies the model returns the same top class across nearly all samples, whereas variation in predictions indicates significant epistemic uncertainty.

We argue that the same principles apply to retrieval and ranking tasks, including cross-modal retrieval. A Monte Carlo Dropout (MCD) or Deep Ensemble ranker outputs a predictive relevance distribution rather than a single score, and the dispersion of this predictive distribution corresponds to the model’s uncertainty in the ranking. By leveraging posterior sampling (via dropout or ensembles) and tracking changes in the top retrieval result, we obtain a principled and quantifiable indicator of epistemic uncertainty in retrieval outcomes, consistent with established definitions of model uncertainty in classification and regression tasks.

In this paper, we propose several training-free approaches to quantify uncertainty in cross-modal retrieval, with a focus on pre-trained CLIP models. We explore simple yet effective baselines, such as cosine similarity-based uncertainty scores, as well as more sophisticated methods for predictive uncertainty estimation, including Monte Carlo Dropout (MCD), Deep Ensembles, and adversarial perturbation-based uncertainty estimation. Our evaluation, conducted on standard datasets (MSCOCO and Flickr30K), demonstrates that the proposed uncertainty measures not only correlate well with retrieval performance but also help to identify unreliable rankings, improve retrieval robustness, and enhance the overall trustworthiness of cross-modal retrieval systems.

\section{Related Work}

Uncertainty estimation has been extensively studied in unimodal tasks such as image classification, natural language processing, and time series forecasting. Methods in Bayesian Neural Networks (\cite{blundell2015weight,neal2012bayesian}), including variational inference and Hamiltonian Monte Carlo, provide principled approaches to estimate predictive uncertainty but are often computationally expensive, limiting their scalability to large models. To address these limitations, more scalable techniques have been developed, such as Monte Carlo Dropout (\cite{gal2016dropout}) and Deep Ensembles (\cite{lakshminarayanan2017simple}), which approximate Bayesian inference through stochastic regularization and model diversity, respectively. Additionally, post-hoc calibration methods like temperature scaling (\cite{guo2017calibration}) have been proposed to adjust confidence estimates without modifying the underlying model.

In the multimodal domain, uncertainty estimation remains less explored. Probabilistic embedding methods (\cite{chun2021probabilistic,li2022differentiable,neculai2022probabilistic,ji2023map}) model cross-modal retrieval as a probabilistic matching task, learning uncertainty-aware representations via probabilistic contrastive losses. However, these methods often require retraining models from scratch, which limits their scalability to large pre-trained vision-language models (VLMs).

To reduce computational overhead, adapter-based approaches (\cite{chunimproved, upadhyay2023probvlm}) have been proposed. For example, ProbVLM (\cite{upadhyay2023probvlm}) introduces a probabilistic adapter trained post hoc to estimate uncertainty distributions from frozen VLM embeddings. While ProbVLM achieves strong calibration without modifying the base model, it still relies on additional training and dataset-specific fine-tuning.

In contrast, our work focuses on training-free uncertainty estimation methods for cross-modal retrieval. We systematically investigate approaches such as top-1 similarity-based uncertainty, Monte Carlo Dropout, Deep Ensembles, and adversarial perturbation-based techniques, all of which can be directly applied to pre-trained models like CLIP without additional fine-tuning. Our goal is to provide practical, computationally efficient predictive uncertainty estimates that improve retrieval robustness and help identify unreliable predictions in cross-modal retrieval.



 

\section{Uncertainty Estimation in Cross-Modal Retrieval Models}


In this section, we introduce our framework for estimating uncertainty in cross-modal retrieval models. We first discuss a simple baseline using similarity scores as an uncertainty measure before exploring probabilistic techniques such as Monte Carlo Dropout and Deep Ensembles. Finally, we propose an adversarial perturbation-based approach that quantifies uncertainty based on retrieval robustness.


\subsection{Background}
\label{sec:background}
State-of-the-art cross-modal retrieval models such as CLIP learn embedding functions $\phi_q$ for text queries and $\phi_I$ for images (and vice versa). These functions project their respective inputs into a shared embedding space, aiming to position the representation of a text query $\phi_q(q)$ and an image $\phi_I(I)$ closely together if the image $I$ is relevant to the query $q$. The similarity between embeddings, quantified using a similarity metric $\operatorname{sim}(\phi_q(q), \phi_I(I))$, guides the retrieval of relevant images in response to a given text query. Let $R(q,\mathcal{I})$ denote the retrieval ranking for a query $q$ obtained as:

\begin{equation}
    R(q,\mathcal{I}) = \operatorname{argsort}_{I \in \mathcal{I}} [\operatorname{sim}(\phi_q(q), \phi_I(I))]
\end{equation}

where $\operatorname{argsort}$ sorts images in in the retrieval set $\mathcal{I}$ in descending order by similarity score.

The standard evaluation metric for cross-modal retrieval models is \emph{Recall at Rank K} (R@K), which measures the proportion of queries where a relevant item appears in the top-K results. This metric is particularly preferred in datasets such as MSCOCO and Flickr30k, where each text query has only a single relevant image, and each image query has only five relevant captions. This sparsity in ground-truth relevance makes other retrieval metrics like Mean Average Precision (mAP) less suitable, as they assume multiple relevant items per query. R@K is also more suitable for practical retrieval tasks where users primarily interact with top-ranked results.

In this scenario, a natural approach for estimating retrieval confidence is to use the distance to the top-1 retrieved item as a proxy for the uncertainty for a given query ranking; similar to using the max value of a softmax as a proxy for predictive uncertainty of classification models (\cite{guo2017calibration}). In metric spaces, confidence and uncertainty are inherently linked to the density of relevant items. In high-confidence cases, queries should be embedded close to their correct match, yielding high similarity scores, whereas ambiguous queries exhibit lower similarity due to embedding uncertainty. Therefore, we define a simple confidence measure as:

\begin{equation}
    C(q) = \operatorname{sim}(\phi_q(q), \phi_I(I^*)),
\end{equation}

where $I^* = \operatorname{argmax}_{I \in \mathcal{I}} [\operatorname{sim}(\phi_q(q), \phi_I(I))]$ is the top-1 retrieved image for query $q$. In CLIP, the similarity function $\operatorname{sim}$ is cosine similarity, ensuring that the confidence score $C(q)$ is bounded in the range $[0,1]$. This bounded range makes it an interpretable and normalized proxy for confidence estimation. In the experimental section, we will analyze how this simple confidence measure demonstrates strong calibration with retrieval performance (in terms of R@K), establishing it as an effective baseline for uncertainty estimation ($U(q) = 1-C(q)$) in cross-modal retrieval.


\subsection{Monte Carlo Dropout}
\label{sec:mcd}


Monte Carlo Dropout  (\cite{gal2016dropout}) provides an approximation to Bayesian inference in deep neural networks by enabling dropout (\cite{srivastava2014dropout}) at inference time, effectively sampling from the approximate posterior distribution. More formally, given a neural network with weights $W$, we introduce stochasticity through a dropout mask $z \sim \operatorname{Bernoulli}(p)$ applied independently to each layer during each forward pass:

\begin{equation}
    y^*(x, W, z) = f(x; W, z),
\end{equation}

where $y^*(x, W, z)$ is the output given input $x$. The Bayesian posterior predictive mean is approximated using $M$ stochastic forward passes:

\begin{equation}
    \mathbb{E}_{\hat{p}(y^*|x^*)}[y^*] \approx \frac{1}{M} \sum_{m=1}^{M} y_m^*,
\end{equation}

where $y_m^* = f(x^*; W, z_m)$ is the output from the $m$-th stochastic forward pass. Similarly, the predictive variance is given by:

\begin{equation}
\begin{aligned}
    \operatorname{Var}_{\hat{p}(y^*|x^*)}(y^*) \approx & \tau^{-1} I + \frac{1}{M} \sum_{m=1}^{M} y_m^* y_m^{*T} \\
    & - \mathbb{E}_{\hat{p}(y^*|x^*)}[y^*] \mathbb{E}_{\hat{p}(y^*|x^*)}[y^*]^T
\end{aligned}
\end{equation}

where \( \tau^{-1} I_D \) represents the observation noise variance, accounting for aleatoric uncertainty; the second term captures epistemic uncertainty by averaging the variance of multiple stochastic forward passes; while the final term ensures proper centering of the variance estimate around the predictive mean. This formulation enables uncertainty estimation by leveraging variability across multiple stochastic forward passes.



\cite{gal2016dropout} demonstrated the effectiveness of Monte Carlo Dropout (MCD) for regression and classification tasks, showing that dropout can serve as an efficient Bayesian approximation. Although MCD has been widely applied in unimodal settings such as image classification (\cite{gustafsson2020evaluating}) and natural language processing (\cite{xiao2019quantifying}), its application to cross-modal retrieval remains largely unexplored.

A key challenge in applying MCD to retrieval models is that uncertainty estimation in retrieval is fundamentally different from both classification and regression tasks. In classification, uncertainty is estimated over discrete class probabilities, while in regression, it is captured by the variance of scalar outputs. However, in retrieval models, outputs are rankings derived from distances in a high-dimensional embedding space. In this context, the embedding functions $\phi_q$ and $\phi_I$ can be seen as high-dimensional regressors, mapping input text queries and images (and vice versa) into a shared space where semantic similarity is measured. 

Unlike traditional regression tasks, where uncertainty is directly estimated on a continuous output variable, retrieval uncertainty must be inferred from the variability in ranked similarity scores across stochastic forward passes. Therefore, applying MCD in retrieval requires analyzing the variance of retrieval rankings rather than direct output distributions.

Given a retrieval query $q$ and image gallery $\mathcal{I}$, we obtain $M$ stochastic forward passes of the embedding functions $\phi_q$ and $\phi_I$, resulting in a set of retrieval rankings $\{R^m(q,\mathcal{I})\}_{m=1}^{M}$. From a Bayesian perspective, these retrieval rankings represent samples from the posterior distribution over rankings, induced by the model's uncertainty in embedding representations under dropout. 

To quantify retrieval uncertainty, we propose to measure the consistency of the top-1 retrieval outcome across posterior samples:

\begin{equation}
    U_{\text{MCD}}(q) = 1 - \frac{1}{M} \sum_{m=1}^{M} \mathbb{1}[R^m(q, \mathcal{I})_1 = R^*(q, \mathcal{I})_1],
\end{equation}

where $R^*(q, \mathcal{I})_1$ is the most frequently retrieved top-1 item across all Monte Carlo samples. This formulation reflects epistemic uncertainty, as greater variability in top-1 retrievals suggests higher model uncertainty in ranking stability. 

Intuitively, if the same top-1 item appears consistently across stochastic passes ($U_{\text{MCD}}(q) \approx 0$), the model is confident in its retrieval decision. Conversely, if the retrieved top-1 item varies significantly across posterior samples ($U_{\text{MCD}}(q) \approx 1$), the model exhibits high epistemic uncertainty, signaling potential ambiguity in the ranking.

 In Appendix~\ref{sec:hyperparams} we analyze the sensitivity of MCD uncertainty estimation to key hyperparameters: dropout rate and the number of samples. Our experiments indicate robustness across typical dropout values, with 0.2 providing optimal calibration performance. Moreover, increasing the number of samples improves stability in uncertainty estimates, with 20 samples offering a good trade-off between computational efficiency and performance, although our default choice of 50 ensures more robust results. 

\subsection{Deep Ensembles}
\label{sec:ensemble}
Deep Ensembles \cite{lakshminarayanan2017simple} provide a robust approach for predictive uncertainty estimation by training multiple independent neural networks with different random initializations. Although originally introduced as a non-Bayesian technique, Deep Ensembles have been shown to approximate Bayesian inference \cite{hoffmann2021deep}, where each model in the ensemble represents a sample from a multimodal posterior distribution over the model parameters.

Formally, consider an ensemble of $K$ independently trained retrieval models $\{\mathcal{M}_k\}_{k=1}^{K}$, each parameterized by weights $\theta_k$. The posterior predictive distribution for a new input $x^*$ is approximated as a uniformly-weighted mixture:

\begin{equation}
    \hat{p}(y^*|x^*) = \frac{1}{K} \sum_{k=1}^{K} p_{\theta_k}(y^*|x^*, \theta_k),
\end{equation}

where $p_{\theta_k}(y^*|x^*, \theta_k)$ is the predictive distribution of the $k$-th model. This formulation aligns to Bayesian model averaging, where the ensemble acts as an approximation to the true posterior by representing it as a mixture of delta functions centered at the maximum a posteriori (MAP) estimates of each model’s parameters.

For regression tasks, this mixture can be approximated by a Gaussian distribution, with the posterior predictive mean and variance given by:

\begin{equation}
    \mathbb{E}_{\hat{p}(y^*|x^*)}[y^*] \approx \frac{1}{K} \sum_{k=1}^{K} \mu_{\theta_k}(x^*)
\end{equation}
\begin{equation}
    \begin{aligned}
    \operatorname{Var}_{\hat{p}(y^*|x^*)}(y^*) &\approx  \frac{1}{K} \sum_{k=1}^{K} \left( \sigma^2_{\theta_k}(x^*) + \mu_{\theta_k}^2(x^*) \right) \\
    & - \left( \mathbb{E}_{\hat{p}(y^*|x^*)}[y^*] \right)^2
    \end{aligned}
\end{equation}

where $\mu_{\theta_k}(x^*)$ and $\sigma^2_{\theta_k}(x^*)$ represent the mean and variance predicted by the $k$-th model.


Applying Deep Ensembles to cross-modal retrieval introduces challenges similar to those encountered with Monte Carlo Dropout. Specifically, uncertainty must be inferred from variability in retrieval rankings rather than scalar outputs or probability distributions over classes. Given a query $q$ and an image gallery $\mathcal{I}$, we obtain $K$ retrieval rankings $\{R^k(q,\mathcal{I})\}_{k=1}^{K}$ from each ensemble member.

To quantify retrieval uncertainty, we propose measuring the consistency of the top-1 retrieval across ensemble members:
\vspace{-1em}
\begin{equation}
    U_{\text{Ens}}(q) = 1 - \frac{1}{K} \sum_{k=1}^{K} \mathbb{1}[R^k(q, \mathcal{I})_1 = R^*(q, \mathcal{I})_1],
\end{equation}

where $R^*(q, \mathcal{I})_1$ is the most frequently retrieved top-1 item across all ensemble models. This metric captures epistemic uncertainty, as greater variability among ensemble predictions indicates less confidence in the retrieval outcome.


\subsection{Adversarial Perturbations for Uncertainty Estimation}

Building on the confidence scores based on top-1 distance (our baseline from section \ref{sec:background}) and top-1 consistency (sections \ref{sec:mcd} and \ref{sec:ensemble}), we propose an uncertainty estimation framework based on adversarial perturbations. 
The core idea is that robustness to small perturbations in the embedding space can serve as an indicator of model uncertainty: confident rankings should remain stable under minor changes of the query embedding, while uncertain predictions are more susceptible to fluctuations.

Formally, given an input query $q$ and its corresponding embedding $\phi_q(q)$, an adversarial perturbation $\delta$ is defined as the minimal perturbation required to alter the model's output, in our case, the top-1 retrieved item. This can be expressed as the following optimization problem:

\begin{equation}
\label{eq:adv_opt}
    \delta^* = \min \{ \delta \; | \; R(\phi_q(q) + \delta, \mathcal{I})_1 \neq R(\phi_q(q), \mathcal{I})_1 \}
\end{equation}
    
This formulation seeks the smallest perturbation $\delta^*$ that changes the top-1 retrieval result. Eq. \ref{eq:adv_opt} is solved via Projected Gradient Descent (PGD) \cite{madry2018towards}:
\begin{equation}
    \phi_q(q)^{(t+1)} = \phi_q(q)^{(t)} - \eta \frac{\nabla_q L}{\| \nabla_q L \|_2},
\end{equation}
where $\eta$ is the step size, and $L$ is the difference between the top-1 similarity and the highest-ranked competitor. The final perturbation norm $\| \delta^* \|_2$ serves as a proxy for the model's confidence: 

\begin{equation}
    C_{\text{adv}}(q) = \tanh(\| \delta^* \|_2),
\end{equation}

where $\delta^*$ is the minimal perturbation required to flip the top-1 retrieval, and the $\tanh$ function maps the perturbation norm to a bounded confidence score in $[0, 1]$. The L2 norm offers a practical and interpretable proxy for retrieval robustness, as it directly quantifies how far the query embedding must be displaced to alter the retrieval outcome. 


Notice that we solve the optimization in Eq.~\ref{eq:adv_opt} using PGD in CLIP's embedding space. Specifically, we apply small, normalized gradient steps (with a fixed step size) until the top-1 retrieval result changes or a maximum number of iterations is reached. In this setting, the minimal query embedding perturbation required to alter the top-1 retrieval corresponds to the distance to the nearest decision boundary in embedding space -- that is, the set of points where another candidate becomes more similar than the current top-1 item. Since the cosine similarity function is 1-Lipschitz continuous on the unit sphere, the magnitude of the required perturbation provides a meaningful proxy for retrieval robustness: larger perturbations imply greater distance to the decision boundary, and thus higher model confidence; smaller perturbations indicate proximity to ambiguity regions where the ranking is unstable. This perspective aligns with traditional margin-based uncertainty estimation in classification tasks (e.g., SVMs), where distance to the decision boundary serves as an uncertainty measure. 

In our experiments, we also consider a linear approximation of $\delta^*$ that directly estimates the minimal perturbation required to flip the ranking:
%linearized estimate based on the gradient of the ranking function:
\begin{equation}
    \delta^* \approx \frac{\operatorname{sim}(q, I_1) - \operatorname{sim}(q, I_2)}{\| \nabla_q (\operatorname{sim}(q, I_1) - \operatorname{sim}(q, I_2)) \|_2}.
\end{equation}

where $I_1$ and $I_2$ are the top-1 and top-2 retrieved items. 

In our experiments we apply the adversarial perturbation methods to both text-to-image and image-to-text retrieval tasks. In all cases, the perturbation is applied only to the query embedding (either text or image), while the gallery embeddings remain fixed. 

\section{Experiments}

In this section, we present a comprehensive evaluation of our proposed uncertainty estimation framework for cross-modal retrieval. We begin by describing the datasets and evaluation metrics used in our experiments. This is followed by an in-depth analysis of top-1-based uncertainty estimation techniques, including a comparison of our approach with state-of-the-art probabilistic embeddings to highlight its effectiveness in terms of calibration and efficiency.


\begin{figure*}
    \centering
    \addtolength{\tabcolsep}{-0.9em}
    \begin{tabular}{c c}
        \includegraphics[width=0.5\linewidth]{uai2025-template/figures/calibration_mscoco_v2.pdf} & \includegraphics[width=0.5\linewidth]{uai2025-template/figures/calibration_flickr_v2.pdf} \\
        (a) MSCOCO & (b) Flickr30K \\
        \midrule
    \end{tabular}
    
    \includegraphics[width=\linewidth]{uai2025-template/figures/calibration_legend_v2.pdf}
    \caption{Calibration plots for all considered uncertainty estimation methods on MSCOCO (a) and Flickr30K (b). }
    \label{fig:calibration}
\end{figure*}


\subsection{Datasets and Metrics}

We evaluate our methods on two standard benchmarks for cross-modal retrieval, enabling reproducibility and comparability with prior work: MSCOCO~\cite{lin2014microsoft} and Flickr30K~\cite{young2014image}. 

Flickr30K contains 31,783 images, each paired with five descriptive captions. We follow the standard splits commonly used in cross-modal retrieval benchmarks, such as the CLIP benchmark, with 29,000 images for training, 1,000 for validation, and 1,000 for testing.

MSCOCO-Captions comprises over 123,000 images, each associated with five captions. We adopt the standard 2014 version with 82,783 images for training and 40,504 images for validation/testing. For fair comparison, we follow the established 5K test split protocol, which is widely used in standard becnchmarks.

To assess the quality of our uncertainty estimates, we employ a combination of calibration plots, correlation measures, and rejection curves. Calibration Plots (Reliability Diagrams) visualize the relationship between predicted uncertainty scores and actual retrieval performance (measured by Recall@k). Ideally, well-calibrated models should have points lying close to the diagonal, indicating that the predicted confidence aligns with empirical performance.




Following \cite{upadhyay2023probvlm}, we define uncertainty levels by partitioning the dataset based on predicted uncertainty scores. We then compute the Spearman rank correlation (S) to measure the monotonic relationship between uncertainty levels and Recall@1. A perfectly calibrated model would exhibit a correlation of -1, indicating that performance decreases monotonically with increasing uncertainty. 

On the other hand, the R\textsuperscript{2} score evaluates how well a linear regression model fits the relationship between uncertainty levels and Recall@1. A higher R\textsuperscript{2} indicates a stronger linear trend. We also provide a unified metric (\(-SR^2\)) which combines both scores to provide a single calibration measure. An ideal model would achieve a score of 1.0, reflecting perfect monotonicity and linearity in the relationship between uncertainty and retrieval performance.




\subsection{Implementation details}
For all experiments, we use the ViT-L/14 architecture with original pretrained weights from \cite{radford2021learning}. Appendix~\ref{sec_modelsize} provides additional experiments comparing ViT-L/14 and the smaller ViT-B/32 variant.

For the experiments that involve Monte Carlo Dropout (MCD), we perform 50 stochastic forward passes with $0.2$ dropout rate during inference to approximate the predictive posterior distribution, which is common practice in the MCD literature to achieve stable uncertainty estimates. 




For Deep Ensembles, we construct an ensemble of 12 independently trained ViT-L/14 models sourced from the OpenCLIP repository (\cite{ilharco_gabriel_2021_5143773}). These models are trained on diverse datasets, including OpenAI, LAION, DataComp, MetaCLIP, and DFN. This ensemble configuration enables the capture of diverse model behaviors, contributing to more robust uncertainty estimates through the aggregation of outputs from models with varying inductive biases.


For the Adversarial Perturbation-based uncertainty estimation, we set the perturbation hyperparameters after empirical tuning to a step size of 0.025 and a maximum of 50 iterations, striking a balance between computational efficiency and the effectiveness of the perturbations in revealing model uncertainty. We empirically determined approximately optimal values for these parameters using a hold-out dataset (MSCOCO validation).




\subsection{Uncertainty calibration}

\begin{table*}[]
    \centering
    %\addtolength{\tabcolsep}{-0.2em}
    \begin{tabular}{lcccccccccccc}
    \toprule 
    & \multicolumn{6}{c}{MSCOCO} & \multicolumn{6}{c}{Flickr30K}\\ 
    \cmidrule(lr){2-7}  \cmidrule(lr){8-13}
    & \multicolumn{3}{c}{image2text} & \multicolumn{3}{c}{text2image} & \multicolumn{3}{c}{image2text} & \multicolumn{3}{c}{text2image}\\ 
    \cmidrule(lr){2-4}  \cmidrule(lr){5-7}
    \cmidrule(lr){8-10}  \cmidrule(lr){11-13}
    & S & R\textsuperscript{2} & -SR\textsuperscript{2} & S & R\textsuperscript{2} & -SR\textsuperscript{2} & S & R\textsuperscript{2} & -SR\textsuperscript{2} & S & R\textsuperscript{2} & -SR\textsuperscript{2} \\
    \midrule
    \cite{upadhyay2023probvlm} & -0.99 & 0.93 & 0.93 & -0.30 & 0.35 & 0.10 & \color{gray}{-0.70} & \color{gray}{0.71} & \color{gray}{0.49} & \color{gray}{0.70} & \color{gray}{0.50} & \color{gray}{0.35}\\
    \midrule
    \midrule
    Top1similarity & -1.00 & 0.95 & 0.95 & -1.00 & 0.95 & 0.95 & -0.98 & 0.86 & 0.84 & -1.00 & 0.94 & 0.94\\
    Adversarial & -1.00 & 0.97 & 0.97 & -1.00 & 0.99 & 0.99 & -0.95 & 0.87 & 0.83 & -1.00 & 0.96 & 0.96 \\
    Adversarial lin. & -1.00 & 0.96 & 0.96 & -1.00 & 0.98 & 0.98 & -0.95 & 0.85 & 0.81 & -1.00 & 0.92 & 0.92 \\
    \midrule
Top1similarity (MCD) & -0.92 & 0.88 & 0.82 & -1.00 & 0.96 & 0.96 & -0.97 & 0.85 & 0.83 & -1.00 & 0.96 & 0.96\\
Top1consistency (MCD) & -1.00 & 0.88 & 0.88 & -1.00 & 0.96 & 0.96 & -0.98 & 0.77 & 0.75 & -1.00 & 0.99 & 0.99 \\
Adversarial (MCD) & -1.00 & 0.99 & 0.99 & -1.00 & 0.98 & 0.98 & -0.96 & 0.93 & 0.89 & -1.00 & 0.95 & 0.95 \\
Adversarial lin. (MCD) & -1.00 & 0.97 & 0.97 & -1.00 & 0.95 & 0.95 & -1.00 & 0.95 & 0.95 & -1.00 & 0.91 & 0.91 \\
\midrule
Top1similarity (Ens.) & -0.99 & 0.92 & 0.91 & -1.00 & 0.95 & 0.95 & -0.98 & 0.77 & 0.75 & -0.90 & 0.85 & 0.77\\
Top1consistency (Ens.) & -1.00 & 0.91 & 0.91 & -1.00 & 0.96 & 0.96 & -0.98 & 0.90 & 0.88 & -1.00 & 0.99 & 0.99\\
    Adversarial (Ens.) & -0.97 & 0.89 & 0.86 & -0.99 & 0.90 & 0.89  & -0.90 & 0.83 & 0.75 & -1.00 & 0.84 & 0.84\\

    \bottomrule
    \end{tabular}
    \caption{Uncertainty calibration metrics for all considered methods.  The calibration results of ProbVLM (\cite{upadhyay2023probvlm}) are included for reference, though they are not directly comparable to the other methods (see main text for detailed analysis). Note that the ProbVLM results on Flickr30K are based on models trained on MSCOCO in a cross-dataset scenario.}
    \label{tab:calibration}
\end{table*}

Figure~\ref{fig:calibration} presents calibration plots for all considered methods, while Table~\ref{tab:calibration} provides a quantitative assessment of their calibration in terms of the Spearman rank correlation (S) and R² scores. 

To compute the uncertainty levels used in our analysis, we first define bins based on the range of values produced by each uncertainty measure. Specifically, for each method, we identify the minimum and maximum uncertainty scores and divide this range into 10 equally spaced bins, representing different levels of uncertainty. 

Each query is then assigned to one of these bins based on its corresponding uncertainty score. Within each bin, we compute the retrieval performance in terms of Recall@1 (R@1), which reflects the proportion of queries where the correct item is retrieved at the top rank. This binning process allows us to evaluate how well the model’s predicted uncertainty aligns with its actual retrieval accuracy, providing insights into the calibration of the uncertainty estimates.

In a well-calibrated model, we expect a monotonic decrease in R@1 performance as the uncertainty level increases—indicating that higher uncertainty corresponds to lower retrieval accuracy. This trend is clearly observed in Figure~\ref{fig:calibration}, where R@1 consistently declines across increasing uncertainty levels for most methods, demonstrating effective calibration of the uncertainty estimates.



The results in Table~\ref{tab:calibration} demonstrate all proposed top1-based methods  exhibit exceptional calibration performance, as seen in their consistently low Spearman Rank Correlation (S) and high R² and -SR² scores. These methods directly address uncertainty in retrieval rankings, making them particularly effective for the task at hand.

The \emph{Top-1similarity} baseline achieves near-perfect calibration across both image-to-text and text-to-image retrieval tasks. Its simplicity -- using the cosine similarity between the query and the top-1 retrieved item as a confidence score -- proves highly effective, yielding a $-SR^2 = 0.95$ for image-to-text and text-to-image retrieval in the MSCOCO dataset. 
The method based on Adversarial Perturbations on top of the ranking provided by the deterministic model (``\emph{Adversarial}'' in the table) slightly outperforms the baseline method.

The best uncertainty estimation in terms of average -SR² is the method based on Adversarial Perturbations on top of the MCD ranking -- ``\emph{Adversarial (MCD)}'' in the table. We appreciate that using Monte Carlo Dropout (MCD) or Deep Ensemble (Ens.) improve in some tasks/datasets. Although there is no clear winner overall in terms of calibration, the analysis in section~\ref{sec:rejection} offers a distinct analysis that reveals clear differences among methods.  



\subsubsection*{Comparison with ProbVLM}

ProbVLM (\cite{upadhyay2023probvlm}), while more sophisticated and capable of converting deterministic embeddings into probabilistic ones, demonstrates weaker calibration performance. It is important to highlight that ProbVLM and the rest of considered methods tackle different problems, and thus, their performance metrics are not directly comparable in every aspect.

ProbVLM introduces a probabilistic adapter over pre-trained Vision-Language Models (VLMs) like CLIP, converting their deterministic outputs into probability distributions. However, the calibration results indicate that its uncertainty estimates do not align as closely with retrieval performance as those of the proposed top1-based methods.

The probabilistic approach in ProbVLM is more flexible, enabling the model to capture uncertainties in multi-modal data and supporting advanced downstream tasks like model selection and active learning, which simpler methods cannot do. However, this increased complexity comes at the cost of calibration in retrieval tasks, as shown by its lower -SR² scores compared to the simpler methods proposed.

Moreover, ProbVLM relies on training data for cross-modal alignment, making it more computationally expensive and data-dependent. As an example, notice the lower results on Flickr30K in Table~\ref{tab:calibration} for ProbVLM trained on MSCOCO -- i.e. in a cross-dataset scenario.
In contrast, the proposed methods show that variability/similarity in top-1 retrieval results provides an excellent indicator of retrieval uncertainty, leading to high-quality uncertainty calibration, in a data-agnostic manner.


\subsection{Rejection Plots}
\label{sec:rejection}
We complement calibration analysis with rejection plots which show how retrieval performance improves as increasingly uncertain samples are rejected. This helps visualize the utility of uncertainty estimates in practical scenarios, where unreliable predictions may be filtered out to enhance system robustness. Figure~\ref{fig:rejection} presents rejection plots for all considered uncertainty estimation methods.


\begin{figure*}
    \centering
    \addtolength{\tabcolsep}{-0.9em}
    \begin{tabular}{c c}
        \includegraphics[width=0.5\linewidth]{uai2025-template/figures/rejection_mscoco_v2.pdf} & \includegraphics[width=0.5\linewidth]{uai2025-template/figures/rejection_flickr_v2.pdf} \\
        (a) MSCOCO & (b) Flickr30K  
    \end{tabular}
    
    \caption{Rejection plots for all considered uncertainty estimation methods on MSCOCO (a) and Flickr30K (b). The x-axis represents the number of rejected queries, while the y-axis shows Recall@1. The Area Under the Curve (AUC) for each method is indicated in brackets next to the method names in the figure legends, facilitating direct comparison across methods.}
    \label{fig:rejection}
\end{figure*}

To implement these plots, we first sort all queries in descending order of uncertainty, starting from the most uncertain to the least uncertain. For each uncertainty estimation method, we progressively remove the most uncertain queries in batches and recalculate the retrieval performance after each removal. This process allows us to observe how performance metrics evolve as the most uncertain samples are systematically excluded.

For text-to-image (t2i) retrieval, where we have a total of 25,000 and 5,000 queries in MSCOCO and Flickr30K respectively, we remove 500 text queries at each step. In the case of image-to-text (i2t) retrieval, we remove 100 image queries per step due to the smaller query set (5,000 and 1,000 respectively). After each batch removal, we compute Recall@1 (R@1) for the remaining queries to track performance changes as increasingly uncertain samples are filtered out.

A well-calibrated uncertainty estimation method should show a monotonic improvement in R@1 as the most uncertain queries are removed. This is because the retained queries are those for which the model is more confident, leading to higher retrieval accuracy. The upper-bound curve in the plots represents the theoretical maximum performance achievable if the most challenging queries were perfectly identified and removed.

In addition to visualizing the rejection curves, we quantify performance by computing the area under the curve (AUC) for each method. The AUC is calculated using the trapezoidal rule, which approximates the region under the curve as a series of trapezoids. The area of each trapezoid is computed based on the retrieval performance at consecutive points along the rejection curve. Mathematically, this is expressed as:

\begin{equation}
    \int _{a}^{b}f(x)\,dx\approx (b-a)\cdot {\tfrac {1}{2}}(f(a)+f(b)).
\end{equation}

where $f(x)$ represents the retrieval performance (R@1), and $[a,b]$ are the boundaries of each interval corresponding to the rejection steps. We normalize the number of rejected samples such that the maximum possible area under the curve equals 1. This method provides an efficient and accurate approximation of the overall performance across the entire range of rejected samples.



As shown in Figure~\ref{fig:rejection}, most methods exhibit a clear upward trend, confirming that their uncertainty estimates effectively identify low-confidence predictions. The computed AUC values -- shown in brackets after the methods' names in the figure legends -- reflect the overall performance improvement, with higher AUC indicating better utilization of uncertainty estimates. This trend is particularly evident in both MSCOCO and Flickr30K, where performance approaches the upper bound as a large fraction of uncertain queries is rejected, highlighting the effectiveness of the uncertainty estimates in improving retrieval robustness.

Interestingly, unlike the results observed in the calibration analysis, where all proposed methods performed equally well and showed similar trends, the rejection plots reveal a clear distinction in performance across the different methods. Specifically, methods based on \emph{Top-1consistency} (across Monte Carlo Dropout samples) and adversarial perturbations consistently outperform the top-1 similarity baselines. This indicates that while simple similarity-based measures can provide good overall calibration, more sophisticated approaches like MCD-based consistency and adversarial robustness capture deeper aspects of model uncertainty that translate into better real-world performance when uncertain samples are filtered out.

This divergence in findings between calibration and rejection analyses can be attributed to the different aspects of uncertainty each evaluation metric emphasizes. Calibration analysis primarily assesses how well the model’s predicted uncertainty scores align with actual performance, focusing on the global relationship between uncertainty and accuracy across all samples. In contrast, rejection analysis places greater emphasis on the relative ranking of uncertainty estimates -- it evaluates how effectively the model can prioritize uncertain samples for rejection to maximize performance gains.

While top-1 similarity may provide well-calibrated scores on average, it may lack the fine-grained sensitivity needed to distinguish between subtle differences in uncertainty among hard queries. On the other hand, top-1 consistency (MCD) and adversarial perturbation methods are designed to capture model stability and robustness under perturbations, which are more directly linked to the model’s uncertainty in specific decisions. These methods excel in identifying truly uncertain queries, leading to superior performance in rejection scenarios.



\section{Conclusion}

In this work, we have presented a comprehensive framework for uncertainty estimation in cross-modal retrieval models, exploring different techniques to quantify retrieval confidence. We introduced a range of methods, starting from straightforward top-1 similarity-based measures, progressing through probabilistic approaches like Monte Carlo Dropout (MCD) and Deep Ensembles, and culminating in an adversarial perturbation-based method that assesses uncertainty through retrieval robustness.

Our calibration analysis demonstrated that all proposed methods achieve exceptional calibration performance, with top-1 similarity-based approaches providing strong baseline results. Notably, methods incorporating MCD and adversarial perturbations slightly outperformed the baseline in certain settings, although the differences were not pronounced. This suggests that simple confidence measures, such as cosine similarity to the top-1 retrieved item, can be surprisingly effective for aligning predicted confidence with actual retrieval accuracy.

However, rejection analysis uncovered clear distinctions between the methods. Specifically, techniques based on top-1 consistency across MCD samples and adversarial perturbations consistently outperformed top-1 similarity baselines. These methods excelled at identifying truly uncertain queries, leading to superior performance when filtering out unreliable retrieval rankings. This divergence highlights an important insight: while calibration metrics evaluate global alignment between confidence and performance, rejection analysis is more sensitive to a method’s ability to rank uncertainty effectively -- a critical factor in real-world applications where decisions are made based on the most confident predictions.

Our comparison with ProbVLM (\cite{upadhyay2023probvlm}) reveals that while ProbVLM offers advanced capabilities through probabilistic modeling -- enabling applications like active learning and model selection -- it demonstrated weaker calibration compared to the proposed methods. This performance gap is specially notable in cross-dataset scenarios, due to dataset-specific training dependencies. This highlights an inherent strength of our approach -- dataset agnosticism and superior generalization.

In conclusion, our findings suggest that top1-based, retrieval-focused predictive uncertainty estimation methods, such as MCD-based rank consistency and adversarial perturbation approaches, are not only computationally efficient but also highly effective in both calibration and robustness evaluations. These methods offer strong, data-agnostic performance without the overhead of complex probabilistic modeling, making them well-suited for real-world cross-modal retrieval applications.

While our MCD and Ensemble-based methods do not require additional training, they do incur extra inference-time computation. This overhead scales linearly with the number of MCD samples or the number of models in the Ensemble; however, these computations are trivially parallelizable in practice, leading to minimal time overhead. Moreover, the computational cost can be further mitigated through selective application -- for example, using simpler cosine similarity-based uncertainty (or fewer MCD samples) for routine queries, while reserving more expensive uncertainty estimation for critical or high-risk decisions.

To support reproducibility and further research, the code for all proposed uncertainty estimation methods, along with the evaluation framework used in this work, are made publicly available at \url{http://github.com/lluisgomez/uCLIP}.


%\begin{contributions} % will be removed in pdf for initial submission 
					  % (without ‘accepted’ option in \documentclass)
                      % so you can already fill it to test with the
                      % ‘accepted’ class option
%    Briefly list author contributions. 
%    This is a nice way of making clear who did what and to give proper credit.
%    This section is optional.

%    H.~Q.~Bovik conceived the idea and wrote the paper.
%    Coauthor One created the code.
%    Coauthor Two created the figures.
%\end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    This work is funded by the Ramon y Cajal research fellowship RYC2020-030777-I / AEI / 10.13039/501100011033.
\end{acknowledgements}

% References
\bibliography{uai2025-template}


    
\newpage

\onecolumn

\title{Over the Top-1: Uncertainty-Aware Cross-Modal Retrieval with CLIP \\(Supplementary Material)}
\maketitle

\appendix
\section{Sensitivity Analysis of MCD Hyperparameters}
\label{sec:hyperparams}

In this section, we provide additional results analyzing the sensitivity of the Monte Carlo Dropout (MCD) uncertainty estimation to its key hyperparameters: the dropout rate and the number of stochastic forward passes (samples). In the main paper, we use 50 stochastic forward passes with a dropout rate of $0.2$ during inference, here we evaluate the robustness of our top-1 consistency uncertainty estimates on Flickr30K and MSCOCO retrieval tasks under different settings of these hyperparameters. Calibration plots are shown in Figures \ref{fig:hyperparam_flickr} and \ref{fig:hyperparam_mscoco} respectively.

\begin{figure}[h]
    \centering
    \begin{tabular}{p{\linewidth}}
    \includegraphics[width=0.95\linewidth]{uai2025-template/figures/supplementary/calibration_flickr_top1consistency_param_drop_rate.png}  \\
         (a) Calibration when varying the dropout rate with a fixed number of MCD samples ($num\_samples = 50$). \\
    \includegraphics[width=0.95\linewidth]{uai2025-template/figures/supplementary/calibration_flickr_top1consistency_param_num_samples.png}  \\
         (b) Calibration when varying the number of MCD samples with a fixed dropout rate ($drop\_rate = 0.2$). \\
    \end{tabular}
    
    \caption{Calibration plots of MCD top1-consistency uncertainty estimation on Flickr30K text-to-image (t2i) and image-to-text (i2t) retrieval tasks for different hyperparameter settings. Each curve corresponds to a different hyperparameter configuration, with the $-SR^2$ calibration score shown in brackets in the respective legend entry.}
    \label{fig:hyperparam_flickr}
\end{figure}


\begin{figure}[h]
    \centering
    \begin{tabular}{p{\linewidth}}
    \includegraphics[width=0.95\linewidth]{uai2025-template/figures/supplementary/calibration_mscoco_top1consistency_param_drop_rate.png}  \\
         (a) Calibration when varying the dropout rate with a fixed number of MCD samples ($num\_samples = 50$). \\
    \includegraphics[width=0.95\linewidth]{uai2025-template/figures/supplementary/calibration_mscoco_top1consistency_param_num_samples.png}  \\
         (b) Calibration when varying the number of MCD samples with a fixed dropout rate ($drop\_rate = 0.2$). \\
    \end{tabular}
    
    \caption{Calibration plots of MCD top1-consistency uncertainty estimation on MSCOCO text-to-image (t2i) and image-to-text (i2t) retrieval tasks for different hyperparameter settings. Each curve corresponds to a different hyperparameter configuration, with the $-SR^2$ calibration score shown in brackets in the respective legend entry.}
    \label{fig:hyperparam_mscoco}
\end{figure}

\noindent
\textbf{Effect of Dropout Rate.} Figures~\ref{fig:hyperparam_flickr}(a) and \ref{fig:hyperparam_mscoco}(a) show calibration plots for varying dropout rates while fixing the number of samples to $50$. We observe that the uncertainty estimates are relatively robust across typical dropout values, with a dropout rate of $0.2$ providing slightly better calibration performance overall.

\noindent
\textbf{Effect of Number of Samples.} Figure~\ref{fig:hyperparam_flickr}(b) and \ref{fig:hyperparam_mscoco}(b) show calibration plots for varying the number of samples while fixing the dropout rate to $0.2$. As expected, increasing the number of samples leads to more stable uncertainty estimates. Nevertheless, we find that $20$ samples already provide a good trade-off between performance and computational cost, while our default choice of $50$ samples ensures more stable estimates.



\section{Impact of Model Scale}
\label{sec_modelsize}
To assess the impact of model scale on uncertainty estimation, we conducted additional experiments comparing ViT-L/14 and the smaller ViT-B/32 variant. The results are presented in Table~\ref{tab:calibration2}.


\begin{table}[h]
    \centering
    \addtolength{\tabcolsep}{-0.34em}
    \begin{tabular}{lcccccccccccc}
    \toprule 
    & \multicolumn{6}{c}{MSCOCO} & \multicolumn{6}{c}{Flickr30K}\\ 
    \cmidrule(lr){2-7}  \cmidrule(lr){8-13}
    & \multicolumn{3}{c}{image2text} & \multicolumn{3}{c}{text2image} & \multicolumn{3}{c}{image2text} & \multicolumn{3}{c}{text2image}\\ 
    \cmidrule(lr){2-4}  \cmidrule(lr){5-7}
    \cmidrule(lr){8-10}  \cmidrule(lr){11-13}
    & S & R\textsuperscript{2} & -SR\textsuperscript{2} & S & R\textsuperscript{2} & -SR\textsuperscript{2} & S & R\textsuperscript{2} & -SR\textsuperscript{2} & S & R\textsuperscript{2} & -SR\textsuperscript{2} \\
    \midrule
    %\cite{upadhyay2023probvlm} & -0.99 & 0.93 & 0.93 & -0.30 & 0.35 & 0.10 & {\color{gray}{-0.70}} & {\color{gray}{0.71}} & {\color{gray}{0.49}} & {\color{gray}{0.70}} & {\color{gray}{0.50}} & {\color{gray}{0.35}}\\
    %\midrule
    %\midrule
    Top1similarity (ViT-L/14) & -1.00 & 0.95 & 0.95 & -1.00 & 0.95 & 0.95 & -0.98 & 0.86 & 0.84 & -1.00 & 0.94 & 0.94\\
    Top1similarity (ViT-B/32)  & {\cellcolor{red!0}  -1.00 } & {\cellcolor{blue!0}  0.95 } & {\cellcolor{blue!0}  0.95 } & {\cellcolor{red!0}  -1.00 } & {\cellcolor{blue!0}  0.95 } & {\cellcolor{blue!0}  0.95 } & {\cellcolor{blue!4}  -1.00 } & {\cellcolor{red!6}  0.83 } & {\cellcolor{red!2}  0.83 } & {\cellcolor{red!4}  -0.98 } & {\cellcolor{blue!4}  0.96 } & {\cellcolor{blue!0}  0.94 }\\
\midrule
MCD Top1similarity (ViT-L/14) & -0.92 & 0.88 & 0.82 & -1.00 & 0.96 & 0.96 & -0.97 & 0.85 & 0.83 & -1.00 & 0.96 & 0.96\\
MCD Top1similarity (ViT-B/32)  & {\cellcolor{blue!5}  -0.95 } & {\cellcolor{red!4}  0.86 } & {\cellcolor{blue!0}  0.82 } & {\cellcolor{red!0}  -1.00 } & {\cellcolor{blue!0}  0.96 } & {\cellcolor{blue!0}  0.96 } & {\cellcolor{blue!2}  -0.98 } & {\cellcolor{blue!14}  0.92 } & {\cellcolor{blue!14}  0.90 } & {\cellcolor{red!0}  -1.00 } & {\cellcolor{red!2}  0.95 } & {\cellcolor{red!2}  0.95}\\
\hdashline[0.5pt/1pt]
MCD Top1consistency (ViT-L/14) & -1.00 & 0.88 & 0.88 & -1.00 & 0.96 & 0.96 & -0.98 & 0.77 & 0.75 & -1.00 & 0.99 & 0.99 \\
MCD Top1consistency (ViT-B/32)  & {\cellcolor{red!4}  -0.98 } & {\cellcolor{red!8}  0.84 } & {\cellcolor{red!12}  0.82 } & {\cellcolor{red!0}  -1.00 } & {\cellcolor{blue!6}  0.99 } & {\cellcolor{blue!6}  0.99 } & {\cellcolor{red!4}  -0.96 } & {\cellcolor{red!10}  0.72 } & {\cellcolor{red!10}  0.70 } & {\cellcolor{red!0}  -1.00 } & {\cellcolor{red!2}  0.98 } & {\cellcolor{red!2}  0.98 }\\
\hdashline[0.5pt/1pt]
MCD Adversarial (ViT-L/14) & -1.00 & 0.99 & 0.99 & -1.00 & 0.98 & 0.98 & -0.96 & 0.93 & 0.89 & -1.00 & 0.95 & 0.95 \\
MCD Adversarial (ViT-B/32)  & {\cellcolor{red!2}  -0.99 } & {\cellcolor{red!13}  0.92 } & {\cellcolor{red!15}  0.91 } & {\cellcolor{red!4}  -0.98 } & {\cellcolor{red!21}  0.87 } & {\cellcolor{red!26}  0.85 } & {\cellcolor{blue!6}  -0.99 } & {\cellcolor{red!6}  0.90 } & {\cellcolor{blue!0}  0.89 } & {\cellcolor{red!6}  -0.97 } & {\cellcolor{red!33}  0.78 } & {\cellcolor{red!37}  0.76 }\\
\hdashline[0.5pt/1pt]
MCD Adversarial lin. (ViT-L/14) & -1.00 & 0.97 & 0.97 & -1.00 & 0.95 & 0.95 & -1.00 & 0.95 & 0.95 & -1.00 & 0.91 & 0.91 \\
MCD Adversarial lin. (ViT-B/32)  & {\cellcolor{red!10}  -0.95 } & {\cellcolor{red!15}  0.89 } & {\cellcolor{red!24}  0.85 } & {\cellcolor{red!0}  -1.00 } & {\cellcolor{red!17}  0.86 } & {\cellcolor{red!17}  0.86 } & {\cellcolor{red!2}  -0.99 } & {\cellcolor{red!7}  0.91 } & {\cellcolor{red!9}  0.90 } & {\cellcolor{red!2}  -0.99 } & {\cellcolor{red!26}  0.78 } & {\cellcolor{red!28}  0.77 }\\

    \bottomrule
    \end{tabular}
    \caption{Uncertainty calibration metrics for all considered methods using CLIP ViT-L/14 and ViT-B/32. }
    \label{tab:calibration2}
\end{table}

Our analysis shows that while the larger ViT-L/14 model achieves better calibration metrics overall, the uncertainty estimation performance of ViT-B/32 remains competitive and consistent across tasks. Specifically, we observe that for Top1-similarity and Top1-consistency methods, the relative differences between the two models are moderate, suggesting that these uncertainty estimates are robust to model scale. In contrast, for the adversarial methods, the differences between ViT-L/14 and ViT-B/32 are more pronounced, indicating a stronger dependence on model size. 


\end{document}
