\documentclass[accepted]{uai2025} % for initial submission
% \documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
% \usepackage[american]{babel}
% \usepackage[british]{babel}
\input{math_commands.tex}

% %% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
\usepackage{times}
%     \bibliographystyle{plainnat}
%     \renewcommand{\bibsection}{\subsubsection*{References}}
% \usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
% \usepackage{booktabs} % commands to create good-looking tables
% \usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks

\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{wrapfig}
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xfrac}
\usepackage{graphicx}
\usepackage{multirow}
\usepackage{subcaption} % For subtable
\usepackage{enumitem}
\usepackage{pifont}
\usepackage{bigdelim,dcolumn,booktabs}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{amsmath}
\usepackage{array}
\usepackage{tablefootnote}


%% Self-defined macros
\usepackage{xspace}  % GWT: xspace will automatically take care of spacing after your macros
\newcommand{\eacp}{\texttt{E{\small A}CP}\xspace}
\newcommand{\ecp}{\texttt{ECP}\xspace}
\newcommand{\scp}{\texttt{SplitCP}\xspace}
\newcommand{\calX}{\mathcal{X}}
\newcommand{\calY}{\mathcal{Y}}
\newcommand{\rcp}{\texttt{RC}\xspace}
\newcommand{\calD}{\mathcal{D}}
\newcommand{\test}{\mathrm{test}}
\newcommand{\iid}{i.i.d.\@\xspace}
\newcommand{\wrt}{w.r.t.\@\xspace}
\newcommand{\calC}{\mathcal{C}}
\newcommand{\rcc}{\texttt{RC}\xspace}
\usepackage[table]{xcolor}

\hypersetup{
    colorlinks,
    allcolors={blue!50!black}
}

\colorlet{linkequation}{blue!50!black}

\setitemize{topsep=0pt,itemsep=0pt,leftmargin=*}
\setenumerate{topsep=0pt,itemsep=0pt,leftmargin=*}


\newcommand{\cmark}{\text{\ding{51}}}
\newcommand{\xmark}{\text{\ding{55}}}

\definecolor{darkgreen}{HTML}{75C851}

% \newcolumntype{P}[1]{>{\centering\arraybackslash}p{#1}}

\title{Adapting Prediction Sets to Distribution Shifts Without Labels}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% \href{mailto:<jj@example.edu>?Subject=Your UAI 2025 paper}
\author[1,2]{Kevin Kasa$^{\dagger}$}
\author[3]{Zhiyu Zhang}
\author[3]{Heng Yang}
\author[1,2]{Graham W. Taylor}
% Add affiliations after the authors
\affil[1]{%
    School of Engineering\\
    University of Guelph
}
\affil[2]{%
    Vector Institute for Artificial Intelligence    
}
\affil[3]{%
    Harvard University
}
  
  \begin{document}

% \def\biblio{}
% \date{\vspace{-5ex}}


\maketitle

\begingroup
\renewcommand{\thefootnote}{}
\footnotetext{$^{\dagger}$Work done while visiting Harvard University.}
\endgroup

\begin{abstract}
Recently there has been a surge of interest to deploy confidence set predictions rather than point predictions in machine learning. Unfortunately, the effectiveness of such prediction sets is frequently impaired by distribution shifts in practice, and the challenge is often compounded by the lack of ground truth labels at test time. Focusing on a standard set-valued prediction framework called conformal prediction (CP), this paper studies how to improve its practical performance using only unlabeled data from the shifted test domain. This is achieved by two new methods called \ecp and \eacp, whose main idea is to adjust the score function in CP according to its base model's own uncertainty evaluation. Through extensive experiments on a number of large-scale datasets and neural network architectures, we show that our methods provide consistent improvement over existing baselines and nearly match the performance of fully supervised methods. Our code can be found at: \href{https://github.com/uoguelph-mlrg/EaCP}{https://github.com/uoguelph-mlrg/EaCP}.

\end{abstract}

\section{Introduction}

Advances in deep learning are fundamentally changing the autonomous decision making pipeline. While most works have focused on accurate point predictions, quantifying the uncertainty of the model is arguably as important. Taking autonomous driving for example: if a detection model predicts the existence of an obstacle, it would be reasonable to take different maneuvering strategies depending on the confidence of the prediction. But is that reliable? In a possible failure mode, the model could report $60\%$ (resp. $99\%$) confidence, but the \emph{probability} of an obstacle actually showing up is $99\%$ (resp. $60\%$). Such discrepancy between the model's own uncertainty evaluation and the ground truth probability (or post-hoc frequency) is commonly observed \citep{guo2017calibration,liang2023holistic}, and can compromise the safety in downstream decision making. 

% Set-valued classification is a broad framework for returning prediction sets that better equip a machine learning model to deal with this uncertainty and ambiguity \citep{chzhen2021setvalued, denis2017confidence}. 

Set-valued prediction provides an effective way to address this problem \citep{chzhen2021setvalued}, with \emph{conformal prediction} (CP; \citealp{vovk2005algorithmic}) being a well-known special case. Given a fixed black-box machine learning model (called the \emph{base model}) and a covariate $x_{\mathrm{test}}$, the goal  of CP is to generate a prediction set $\calC_{\mathrm{test}}$ that contains (or, \emph{covers}) the unknown ground truth label $y_{\mathrm{test}}$ with a pre-specified probability. Crucially, CP relies on the assumption that the distribution of the data stream is \emph{exchangeable} (a weaker variant of \iid), which allows the fairly straightforward inference of $y_{\mathrm{test}}$ from $x_{\mathrm{test}}$ and the base model's performance on a pre-collected \emph{calibration dataset}. Note that the ground truth label $y_{\mathrm{test}}$ does not need to be revealed after the set prediction is made: exchangeability together with a large enough labeled calibration dataset is sufficient to ensure the desirable coverage probability. This is particularly important for autonomous decision making, where real-time data annotation is expensive or even infeasible. 

However, real-world data streams are usually corrupted by all sorts of \emph{distribution shifts}, violating the exchangeability assumption. Even when the data stream itself is exchangeable, we often want to continually update the base model rather than keeping it fixed, and this can be effectively understood as a distribution shift in the context of CP. In such cases, simply applying exchangeability-based methods could lead to highly inaccurate prediction sets \citep{tibshirani2019conformal,bhatnagar2023improved,kasa2023empirically}. Therefore, making CP compatible with distribution shifts has become a focal point of recent works. 

A number of solutions have been proposed, but the key challenge still remains. For example, \cite{gibbs2021adaptive} formulated the connection between CP and \emph{Online Convex Optimization} (OCO; \citealp{Zinkevichonline}), and the latter is able to handle arbitrarily distribution-shifted environments. The weakness is that ground truth labels are now required at test time (which we call \emph{full supervision}), as opposed to the standard CP procedure. In the other direction, there are CP methods that combat distribution shifts without test time labels \citep{tibshirani2019conformal,barber2023conformal,cauchois2024robust}, but they typically assume the distribution shifts are ``easy'', such that even without labels, we can still rigorously infer the test distribution to a certain extent using the labeled calibration dataset. Overall, it appears that handling both difficulties -- distribution shifts and the lack of test time labels -- is a formidable but important remaining challenge. 


\paragraph{Contributions} Focusing on classification, this paper develops practical unsupervised methods to improve the accuracy degradation of CP prediction sets under distribution shifts. The overarching idea is to exploit the uncertainty evaluation of the base model itself. Although such a quantity is not always calibrated in a strict sense, it has been consistently observed to strongly correlate with the magnitude of distribution shifts \citep{hendrycks2017baseline,wang2021tent,kang2024deep}, thus providing a valuable way to probe the test distribution without label access. Under this high level idea, we make the following contributions.
\begin{itemize}
\item First, we propose a new CP-inspired method named \ecp~(Entropy scaled Conformal Prediction). The key idea is to scale up the \emph{score function} in standard CP by an ``entropy quantile'' of the base model, calculated on the unlabeled test dataset, which measures the base model's own uncertainty on the test distribution. 

% Such an entropy quantile measures the base model's own uncertainty on the test distribution, and is enforced to be greater than 1.\footnote{This is to ensure that the prediction sets do not become smaller on in-distribution data.} 

More precisely, given each covariate $x_{\mathrm{test}}$ at test time, the score function in standard CP is determined by the fixed base model, and assigns each candidate label a ``propensity score''. Then, the CP prediction set $\calC_{\mathrm{test}}$ simply includes all the candidate labels whose score is above a certain threshold.\footnote{The score functions are assumed to be \emph{positively oriented} \citep{sadinle_least_2019}: labels with larger score are more likely to be included in the prediction set.} By scaling up the score function while keeping the threshold fixed, \ecp~makes the prediction sets larger, which naturally corresponds to the intuition that the uncertainty of prediction should be inflated under distribution shifts. Moreover, the amount of such inflation is strongly correlated with the magnitude of the distribution shift, through the use of the entropy quantile. 

\item Second, we refine \ecp~using techniques from unsupervised \emph{Test Time Adaptation} (TTA) \citep{niu2022efficient}, and the resulting method is named \eacp~(Entropy base-adapted Conformal Prediction). The key idea is that while \ecp~keeps the base model fixed at test time, we can concurrently update it using \emph{entropy minimization} \citep{grandvalet2004semi,wang2021tent} -- a widely adopted idea in unsupervised TTA, alongside the aforementioned entropy scaling. This ``adaptively'' reduces the scaling effect that \ecp~applies to the score function, thus shrinking the prediction sets of \ecp~smaller. 

\item Finally, we evaluate the proposed methods on a wide range of large-scale datasets under distribution shifts, as well as different neural network architectures. We find that exchangeability-based CP (with and without TTA on the base model) consistently leads to lower-than-specified coverage frequency. However, despite the absence of practical statistical guarantees in this setting, our methods can effectively mitigate this under-coverage issue while keeping the sizes of the prediction sets moderate. Furthermore, our methods also significantly improve the prediction sets generated by the base model itself (without CP). It shows that by bridging the CP procedure (which is statistically sound) and the base model's own uncertainty evaluation (which is often informative), our methods enjoy the practical benefit from both worlds. 
\end{itemize}



% \section{Related works}

% \textbf{CP under distribution shifts}\quad 

\paragraph{Related works} Considerable efforts have been devoted to developing CP methods robust to distribution shifts, which can be approximately categorized into two directions. The first direction does not require test time labels \citep{tibshirani2019conformal,cauchois2024robust}, but the distribution shift is assumed to be simple in some sense. The second direction is connecting CP to adversarial online learning \citep{gibbs2021adaptive}, but the true labels are required at test time. Due to space constraints, a thorough discussion is deferred to Appendix~\ref{appendix:additional_related}, as well as a number of applications that motivate this work. 
% \begin{itemize}
% \item The first direction does not require test time labels, but the distribution shift is assumed to be simple in some sense. For example, \citet{tibshirani2019conformal} studied CP under \emph{covariate shifts}, where the distribution of the label $y$ conditioned on the covariate $x$ remains unchanged. Here, it suffices to use the classical \emph{likelihood ratio reweighting} on the calibration dataset, but accurately estimating the likelihood ratio can be challenging in practice. Another idea is to take a robust optimization perspective by assuming a certain maximum level of distribution shift and protecting against the worst case, e.g., \citep[Chapter~8]{roth2022uncertain} and \citep{cauchois2024robust}. The weakness here is the sensitivity to the hyperparameter, and the obtained prediction sets could be overly conservative. 

% Various works built on these two ideas. \citet{barber2023conformal} generalized the reweighting idea to handle mild but general distribution shifts, but choosing the weights is generally unclear in practice. \citet{ai2024not} tackled general distribution shifts by combining reweighting and robust optimization, which also combines the strengths and limitations from the two sides. \citet{ge2024optimal} extended the two ideas to the aggregation of multiple CP algorithms. 

% \item The second direction is connecting CP to adversarial online learning. A line of works \citep{gibbs2021adaptive,angelopoulos2023conformal,gibbs2024conformal, saocp, zhang2024discounted} applied regret minimization algorithms in OCO to select the score threshold in CP, and \cite{bastani2022practical} achieved this task using \emph{multicalibration}. By relaxing the CP objective from the \emph{coverage probability} to the \emph{post-hoc coverage frequency}, these methods can handle arbitrary continual distribution shifts. However, they require the true label to be provided after every prediction, which is a limiting requirement for many use cases in autonomous decision making. Our experiments will show that it is possible to achieve comparable performance in these settings without this limitation, i.e., being ``label free''.
% \end{itemize}

% \textbf{Unsupervised \emph{Test Time Adaptation} (TTA)}\quad 

Our techniques are inspired by core ideas in (unsupervised) TTA, whose goal is to update a trained machine learning model at test time, using unlabeled data from shifted distributions. To achieve this, one could update the batch-norm statistics on the test data \citep{nado2020evaluating, schneider2020improving, khurana2021sita}, or minimize the test-time \emph{prediction entropy} -- a natural measure of the model's uncertainty \citep{wang2021tent, zhang2022memo, niu2022efficient, song2023ecotta,press2024entropy}. Notably, these methods can be applied to any probabilistic and differentiable model (such as modern neural networks), which is naturally congruent with the key strength of CP. However, to date this line of works has not been connected to the conformal prediction literature.

% , and to our knowledge there are no studies investigating the benefit of TTA to the distribution-shifted CP setting. 

\section{Preliminaries of CP}\label{section:preliminary}

We begin by introducing the standard background of CP without distribution shifts. For clarity, we assume \iid data in our exposition, rather than the slightly weaker notion of exchangeability. Also see \citep{roth2022uncertain,angelopoulos2022gentle,cplecture}.

Let $\calD$ be an unknown distribution on the space $\calX\times\calY$ of covariate-label pairs, and let $\alpha\in(0,1)$ be the \emph{error rate} we aim for. Given a calibration dataset $D$ consisting of $n$ \iid samples $\{x^*_i,y^*_i\}_{i\in[n]}\sim\calD^n$, the goal of CP is to generate a set-valued function $\calC:\calX\rightarrow 2^\calY$, such that
\begin{equation}
\mathbb{P}_{(x_\test,y_\test)\sim\calD,D\sim\calD^n}\left[y_\test\in \calC(x_\test)\right] \geq 1 -\alpha.\label{guarantee}
\end{equation}
That is, for a fresh test sample $(x_\test,y_\test)\sim\calD$, our prediction set $\calC(x_\test)$ covers the ground truth label $y_\test$ with guaranteed high probability. Notice that Eq.(\ref{guarantee}) alone is a trivial objective, since it suffices to predict the entire label space $\calC(x)=\calY$ for all $x$. Therefore, CP is essentially a bi-objective problem: as long as Eq.(\ref{guarantee}) is satisfied, we want the prediction set $\calC(x)$ to be small. 

The main difficulty of this set-valued prediction problem is that the range of output $2^\calY$ is too large. In this regard, the key idea of CP is reducing the problem to 1D prediction via a trained machine learning model (called the \emph{base model}), such as a neural network. Specifically, we assume access to a (positively oriented; i.e., larger is better) \emph{score function} $s:\calX\times\calY\rightarrow\mathbb{R}_+$ given by the base model, such that for each test covariate $x_\test\in\calX$ and \emph{candidate label} $y\in\calY$, $s(x_\test,y)$ measures how likely the model believes that $y$ is the true label $y_\test$. Then, all there is left for CP is to pick a threshold $\tau_D\in\mathbb{R}$ that depends on the dataset $D$, and predict the label set (if the score function is negatively oriented, then $\geq$ is replaced by $\leq$)
\begin{equation}\label{thr_set}
\calC(x_\test) := \left\{y\in\calY: s(x_\test,y) \geq \tau_D\right\}.
\end{equation}

Under the \iid assumption, the coverage objective Eq.(\ref{guarantee}) is satisfied by picking $\tau_D$ as the $\alpha(1-n^{-1})$-quantile of the \emph{empirical scores} $\{s(x^*_i,y^*_i)\}_{i\in[n]}$. Since the training data of the base model is split from the calibration dataset used to determine $\tau_D$, this approach is commonly known as \emph{split conformal prediction}, which we refer to as \scp. Notably, $\tau_D$ is determined by the calibration dataset $D$; once the latter is fixed, there is no need to access the ground truth labels at test time. 

\paragraph{Examples in classification} This paper focuses on classification. In this case, a simple and popular choice of the score function is $s(x,y) = \pi_\theta(x)_{y}$ \citep{sadinle_least_2019}, where $\pi_\theta$ is a trained neural network parameterized by $\theta$, and $\pi_\theta(x)_{y}\in[0,1]$ is the softmax score corresponding to one of the $k$-classes $y\in[k]$. Such a score function is positively oriented, which we adopt in this work. Another well-known choice due to \cite{romano2020classification} is negatively oriented, and our methods can be applied there as well.

\paragraph{Distribution shift} For the rest of this paper, we study the following deviation of the above standard CP problem. At test time, instead of working with a single test sample $(x_\test,y_\test)$ drawn from $\calD$, we consider a size-$N$ collection of samples\footnote{The clearest notation is to index the test samples by $(x_{\test,i},y_{\test,i})$. Here we omit the subscript ``$\test$'' for conciseness.} $\{x_i,y_i\}_{i\in[N]}$ drawn from some new unknown distribution $\calD_\test$. We only observe the covariates, defined as the \emph{test dataset} $D_\test=\{x_i\}_{i\in[N]}$. Importantly, the ground truth labels on $D_\test$ are not revealed even after predictions are made. The goal, from a practical perspective, is to output a small prediction set $\calC(x_i)$ at each test covariate $x_i$, satisfying the specified \emph{empirical coverage rate}, 
\begin{equation*}
\frac{1}{N}\sum_{i=1}^N\bm{1}[y_i\in\calC(x_i)]\geq 1-\alpha.
\end{equation*}
The function $\calC$ can now depend on both the labeled calibration dataset $D$ and the unlabeled test dataset $D_\test$.

% where the test sample $(x,y)$ is \emph{not} drawn from $\calD$, but rather some new unknown distribution $\widetilde{\calD}_\test$. 

In general, it is impossible to prove meaningful bounds without assuming some form of similarity between $\calD$ and $\calD_\test$, but we will show that with help from the base model, the CP procedure can be modified to work well in practice.

\section{Our methods}

In this section, we first propose a method called \ecp~(Entropy scaled Conformal Prediction), which improves the coverage rate of CP by enlarging its prediction sets using the uncertainty evaluation of the base model itself. Crucially, this notion of uncertainty can be directly minimized and refined through unsupervised TTA, leading to an improved method called \eacp~(Entropy base-Adapted Conformal Prediction). The latter is able to both recover the desired error rate on many challenging distribution-shifted datasets, and significantly reduce inflated set sizes under increased uncertainty.


% \begin{itemize}
% \item recover the desired error rate on many challenging distribution-shifted datasets; and
% \item significantly reduce inflated set sizes under increased uncertainty.   
% \end{itemize}

\subsection{Scaling conformal scores by uncertainty}\label{scn:scaling-uncert}

% A common criticism of the original \scp (although typically phrased in the regression setting rather than classification) is that the size of the CP prediction set $\calC(x)$ is invariant to the covariate $x$. 

% (i.e., the size of $\calC(x)$ scales with the uncertainty of the base model at $x\in\calX$)

Let us start with a high-level motivation. Within the \scp framework, an important design objective is \emph{local adaptivity}: the size of the prediction set $\calC(x)$ needs to vary appropriately with the covariate $x$. To this end, standard practice is to adjust the score function $s(x,y)$ based on some notion of uncertainty (or difficulty) that the base model decides at each $x$ \citep{PapadopoulosCpKNN, Johanssonbagged, lei2017distributionfree,izbicki2019flexible, romano2019conformalized,seedat2023improving, rossellini24aUncertCQR}. This has the effect of inflating the prediction set on the base model's uncertain regions, and has been shown to improve the more informative \emph{conditional coverage rate} of CP \citep{angelopoulos2022gentle, cplecture}. 

\begin{figure}[h]
% \vspace{-10pt}
    \centering
        \includegraphics[width=0.4\textwidth]{images/ent_vs_smx.pdf}
        \caption{Entropy vs.~the softmax score of the true label, averaged on each dataset. Different colors represent different datasets, and darker shades represent greater severity levels of ImageNet-C corruptions. See Section \ref{sec:datasets} for more details on the datasets.}
        \label{fig:ent_vs_smx}
\end{figure}

\paragraph{Key idea} Inspired by these results, our key idea is to apply an analogous uncertainty scaling on the score function, to improve the performance of CP under distribution shifts. However, instead of using the uncertainty of the base model at each covariate $x$, we draw a crucial connection to unsupervised TTA, and evaluate the base model's uncertainty on the whole distribution-shifted \emph{test dataset} $D_\test$ -- this effectively aggregates its ``localized'' uncertainty at the test covariates $\{x_i\}_{i\in[N]}$. In other words, instead of aiming for ``local adaptivity'' as in prior works, we use uncertainty scaling to achieve the adaptivity \wrt the unknown distribution shift. 

\paragraph{Prediction entropy} More concretely, which uncertainty measure should we use on the base model? As discussed above, the ideal dataset-specific uncertainty measure would follow from a ``localized'' uncertainty measure at each covariate $x$, and in the context of classification, a particularly useful one is the \emph{entropy} of the base model's probabilistic prediction,
\begin{equation*}
h(x) = -\sum_{y\in[k]} \pi_\theta(x)_{y} \text{log}\pi_\theta(x)_{y}.
\end{equation*}

Previous works have established the relation between such an entropy notion and the magnitude of the distribution shift, showing that larger shifts are strongly correlated with higher entropy (thus higher uncertainty in the base model) \citep{wang2021tent,kang2024deep}. We provide a consistent but unique observation in Figure~\ref{fig:ent_vs_smx}, which plots the relation between the entropy (averaged over all $x$ values in the dataset) and the \emph{softmax score of the true label} (also averaged over $x$), evaluated on a ResNet-50 model\footnote{We fix the base model to ResNet-50 in most of our experiments, unless otherwise specified.} and across a range of datasets. For the true label to be included in the CP prediction set, which is eventually what we aim for, its softmax score should be greater than the CP threshold $\tau_D$. Figure~\ref{fig:ent_vs_smx} shows that an increase in entropy is associated with a decrease in the softmax score of the true label, which crucially means that we need to scale up the score function in order to still cover the true label. 

% the prediction sets should be larger when the entropy is higher. 

\begin{figure*}[ht]
    \centering
    \includegraphics[width=0.9\textwidth]{images/adj_vs_cov.pdf}
        \caption{The targeted coverage rate $1-\alpha$ vs. the empirical coverage rate, induced by \ecp~with different $\beta$ values (represented by different colors). It shows that simply setting $\beta=1-\alpha$ in \ecp~(i.e., the blue dots) consistently works well for all but the most severe distribution shifts (e.g., ImageNet-R and ImageNet-C Contrast Severity 5). Such an observation also holds across various $\alpha$ values, suggesting the effectiveness of this hyperparameter choice.  \label{fig:adj_vs_cov}}
\end{figure*}

Now consider going from the ``localized'' uncertainty measure $h(x)$ to an uncertainty measure on the test dataset $D_\test$, denoted as $u_{\mathrm{test}}$. One could use the average $N^{-1}\sum_{i=1}^Nh(x_i)$, but to increase the robustness, we define $u_{\mathrm{test}}$ as the $\beta$-quantile of $\{h(x_i)\}_{i\in[N]}$, where $\beta$ is a hyperparameter. Quite surprisingly, we find that simply setting $\beta$ to the desired coverage rate $1-\alpha$ is a fairly reliable choice in practice (see Figure~\ref{fig:adj_vs_cov}), which gives a robust (over)-estimate of typical $h(x)$ values on the test dataset. We perform all the experiments with this direct relationship to avoid excessive hyperparameter tuning, but it can be further refined if desired. 

\paragraph{Method: \ecp} Now we are ready to use $u_{\mathrm{test}}$ above to scale the score functions on the test dataset, without label access. The resulting method is named as \ecp~(Entropy scaled Conformal Prediction). 

Formally, define $q_\beta(\cdot)$ as the $\beta$-th quantile of its argument, and let the base model's uncertainty measure $u_\test$ be the ``entropy quantile''
\begin{equation}\label{eqn:entropy_quantile}
    u_{\mathrm{test}} =q_{1-\alpha}( \{h(x_i)\}_{i\in[N]}).
\end{equation}
On any test covariate $x_i$, modified from Eq.(\ref{thr_set}), we scale the score function by $\max(1,u_{\mathrm{test}})$ to form the prediction set
\begin{equation}
    \calC(x_i) := \{y\in [k]: s(x_i,y) \cdot \max(1,u_{\mathrm{test}}) \ge \tau_D\}.
    \label{eqn:ecp}
\end{equation}
Here, we take a maximum with $1$ to ensure that the prediction sets of \ecp~cannot be smaller than those of standard \scp. The pseudocode is presented as Algorithm~\ref{alg:eacp} in the next subsection.


To recap, the intuition of \ecp~is that a larger distribution shift will result in larger entropy predicted by the base model, which then leads to a correspondingly larger up-scaling of the score function. In this way, more candidate labels have scores larger than the fixed CP threshold $\tau_D$, and the prediction set grows. Without any access to the test labels, this can help mitigate the under-coverage issue of standard \scp~under distribution shifts, and further details are provided in our experiments (Section~\ref{section:experiment}).


\subsection{Optimizing uncertainty using TTA}

While \ecp\ already improves the coverage rate of \scp~on several datasets, it inevitably leads to larger set sizes and, like typical post-hoc CP methods, still relies on a fixed base model. To remedy this, we refine \ecp~using \emph{entropy minimization} \citep{grandvalet2004semi,wang2021tent}, a classical idea in unsupervised TTA which updates the base model itself on the unlabeled test dataset. Although such techniques in unsupervised TTA have been investigated in the context of top-1 accuracy, we take a different perspective and study their ability to improve set-valued classifiers like conformal predictors.

\paragraph{Key idea} Concretely, we first rewrite the entropy $h(x)$ as a loss function \wrt the base model's parameter $\theta$,
\begin{equation}
    \mathcal{L}(x;\theta):=h(x) = -\sum_{y\in[k]} \pi_\theta(x)_y\text{log}\pi_\theta(x)_y.
    \label{eqn:ent_loss}
\end{equation}
Our main idea is to update the base model by minimizing this loss function (or a suitable variant) on the test dataset $D_\test$, before applying \ecp. This brings two benefits. 
\begin{itemize}
\item The updated base model is better suited for the shifted distribution $\calD_\test$, which generally improves the quality of the prediction sets built on top of it. 
\item The base model's entropy determines the amount of prediction set inflation due to \ecp. By directly minimizing the entropy, the resulting prediction sets can be smaller. 
\end{itemize}

% that by directly minimizing this loss (through updating $\theta$) on the test dataset $D_\test$, we implicitly minimize the entropy quantile $u_{\mathrm{test}}$ used to scale the score functions. Consequently, the sizes of \ecp~prediction sets can be reduced, and the coverage rate can be increased.

\paragraph{Method: \eacp} A number of specific TTA methods have been developed to minimize entropy, while ensuring certain notions of stability. In this work, we leverage a recent method called \texttt{ETA} (Efficient Test-time Adaptation; \citealt{niu2022efficient}), due to its simplicity and effectiveness even under continual distribution shifts \citep{press2024rdumb}. Combining this with \ecp~results in a new CP method, which we call \eacp~(Entropy base-Adapted Conformal Prediction). 

In practice, one could simply call \texttt{ETA} as a subroutine, so here we only present its high level idea for completeness. First, the test dataset $D_\test$ is divided into a collection of batches. On each batch (i.e., $x_i$ with a collection of indices $i$), \texttt{ETA} filters the base model's outputs (i.e., softmax scores) $s(x_i,\cdot)$ by excluding the outputs similar to those already seen. Then, it reweighs the remaining indices based the associated entropy $h(x_i)$, with lower entropy (less uncertain) indices receiving higher weights. This leads to a weighted batch variant of the loss function Eq.(\ref{eqn:ent_loss}), which is then minimized by performing a single gradient update. Subsequently, the updated base model is applied to \ecp~to form the prediction sets of \eacp, according to Eq.(\ref{eqn:ecp}). 

The combined pseudo-code of \ecp~and \eacp~is provided in Algorithm~\ref{alg:eacp}. Here we include an \emph{uncertainty scaling function} $f$ as a small generalization, which acts on the entropy quantile before generating the prediction sets. So far we have only considered the trivial scaling $f(x)=x$, but more choices will be studied in the next subsection. 

%%%%%%%% Algorithm
\begin{algorithm}[t]
\caption{Combined pseudocode of \ecp~and \eacp}\label{alg:eacp}
\begin{algorithmic}

\Require Test dataset $D_\test=\{x_i\}_{i\in[N]}$; trained model with parameter $\theta$ and softmax score $\pi_\theta(x)_y$; targeted error rate $\alpha$; score threshold $\tau_D$ for the error rate $\alpha$, calculated on a calibration dataset $D$; uncertainty scaling function $f:\mathbb{R}_+\rightarrow\mathbb{R}_+$.
% Test batch $(x_i)_n\sim\calX^n$, coverage level $\alpha$, scaling function $f$, score function $s$, calibration threshold $\tau_D$, trained model $\pi_\theta$
\If{\eacp}
    \State $\theta \gets \texttt{ETA}(\theta,D_\test)$ \Comment{TTA sub-routine}
    
    % \State $\theta' \gets \texttt{ETA}(\pi_\theta((x_i)_n))$ \Comment{Do test-time adaptation sub-routine}
\EndIf

\State $u_\test \gets q_{1-\alpha}( \{h(x_i)\}_{i\in[N]})$ \Comment{Eq.(\ref{eqn:entropy_quantile})}

\State $u_\test \gets f(u_\test)$ \Comment{Uncertainty scaling}

% \Comment{Modify the entropy adjustment factor}

\For{$x_i\in D_\test$}
\State \textbf{return} $\calC(x_i)  := \{y\in [k]: s(x_i,y) \cdot \max(1,u_{\mathrm{test}}) \ge \tau_D\}$ \Comment{Predict the label set, Eq.(\ref{eqn:ecp})}

\EndFor

\end{algorithmic}
\end{algorithm}
%%%%%%%% \endAlgorithm


In Section~\ref{section:experiment}, we demonstrate that \eacp~can further improve the empirical performance of \ecp, by increasing the coverage rate while maintaining informative set sizes. 

\subsection{Uncertainty scaling function}\label{scn:uncert-scaling}

\begin{figure*}[!ht]
    \centering
    \includegraphics[width=0.9\textwidth]{images/ent_tauhat_prop-1row.pdf}
        \caption{$u_{\mathrm{test}}$ versus $\tau_D/\tau_{\mathrm{test}}$ on a log-log scale. On mild and moderate distribution shifts, the linear fit on the log-log plot has slope between 1 and 2. This suggests the effectiveness of using a linear or quadratic function as $f(\cdot)$, which acts on the entropy quantile. However, we also observe that a higher-order polynomial is required on more difficult shifts, such as ImageNet-R. \label{fig:ent_tau_relation}}
\end{figure*}

In Eq.(\ref{eqn:ecp}), we essentially scale the score functions linearly by the entropy quantile $u_{\mathrm{test}}$ of the base model. However, this can be adjusted more generally by any (potentially non-linear) function $f(\cdot)$. The best choice of $f(\cdot)$ depends on the unknown relation between $u_{\mathrm{test}}$ and the $(1-\alpha)$-quantile of the ground truth labels' conformal scores, denoted as\footnote{Recall that the data we face at test time is denoted as $\{x_i,y_i\}_{i\in[N]}$, with the observed part (covariates) denoted as $D_\mathrm{test}=\{x_i\}_{i\in[N]}$.}
\begin{equation*}
\tau_{\mathrm{test}}:=q_{1-\alpha}\left[s(x_i,y_i);x_i\in D_\mathrm{test}\right].
\end{equation*}
The optimal $f(\cdot)$ should satisfy $f(u_{\mathrm{test}})=\tau_D/\tau_{\mathrm{test}}$.

While finding this optimal $f(\cdot)$ is obviously infeasible without observing the ground truth labels at test time, in Figure~\ref{fig:ent_tau_relation} we empirically evaluate the ideal choice in a post-hoc manner, across different datasets, in order to demonstrate the insights. Recall that both $u_{\mathrm{test}}$ and $\tau_{\mathrm{test}}$ depend on the desired error rate $\alpha$. Therefore, for each dataset, we vary $\alpha$ and plot the resulting $u_{\mathrm{test}}$ versus $\tau_D/\tau_{\mathrm{test}}$ on a log-log scale. If we mildly restrict $f(\cdot)$ to the family of polynomials, then its optimal order can be approximated by the slope of a linear fit on the log-log plot. We only use not-too-small $\alpha$ values (i.e., the lower left corner on the plot) for the linear fit, since it is closer to the typical practice and less prone to noise. 

Figure~\ref{fig:ent_tau_relation} shows that the optimal polynomial order generally increases with the severity of the distribution shift, which is consistent with the fact that a larger polynomial order would lead to larger prediction sets using our methods. While end-users can refine $f(\cdot)$ based on a preference towards ensuring coverage or small set sizes, we will empirically validate that our methods with either linear scaling (denoted by $\ecp_1$\ / $\eacp_1$) or quadratic scaling (denoted by $\ecp_2$\ / $\eacp_2$) perform well in a wide range of settings. 

\section{Experiments}\label{section:experiment}


We conduct experiments across a number of large-scale datasets and neural network architectures. Our setup builds on the standard \scp\ procedure introduced in Section~\ref{section:preliminary}, which relies on a held-out, in-distribution, ``development set'' for calibrating the CP threshold. On ImageNet variants, we split the original ImageNet development set (i.e., not used for model training) into a CP calibration set consisting of 25,000 samples, and an in-distribution test set (sometimes called the validation set in the CP literature). The readers are referred to Appendix \ref{appendix:exp-dets} for more details.

The conformal threshold is found on the calibration set, and used in subsequent distribution-shifted settings. Importantly, \emph{after the conformal threshold is estimated in-distribution, all subsequent steps are unsupervised.} We show results on both stationary and continuously shifting test distributions. 

\paragraph{Baselines} We compare our proposed methods to the following baselines:
\begin{itemize}
\item \texttt{NAIVE}: generating prediction sets by including classes until their cumulative softmax score is greater or equal to $1-\alpha$ (the target coverage level). This is generated by the base model itself, without the CP post-processing. 
\item Standard \scp: applying the CP threshold directly on the distribution-shifted data.
\item \scp\ with \texttt{ETA}: applying the CP threshold while updating the base model using \texttt{ETA}. 
\end{itemize}

Furthermore, in settings with stationary distribution shifts, we compare to Robust Conformal (\texttt{RC}; \citealt{cauchois2024robust}), an existing CP algorithm that handles distribution shifts via robust optimization. In settings with continual distribution shifts, we compare to a number of OCO-based algorithms \citep{bhatnagar2023improved,gibbs2024conformal,zhang2024discounted} that require additional access to the ground truth labels. 

In all experiments, the target coverage rate is set to 0.90. We also analyze our methods with both linear and quadratic scaling, as described in Section \ref{scn:uncert-scaling}.

\begin{table*}[!ht]
    \caption{\ecp\ and \eacp\ can achieve very competitive empirical coverage rates on a number of distribution-shifted datasets, across a variety of imaging domains (ecological, cellular, satellite, etc). All results are from ResNet-50 models except FMOW, which uses a DenseNet-121 \citep{Huang2016DenselyCC}. Quadratic uncertainty scaling provides better coverage rates, however, linear scaling results in smaller set sizes.\label{tab:natural_shift}}
    \vspace{1mm}
    \centering
    \resizebox{0.7\textwidth}{!}{
    \begin{tabular}{llcccccc}
         & Method & ImageNet-V2 & ImageNet-R & ImageNet-A & iWildCam & RXRX1 & FMOW \\
        \hline
        \multirow[c]{10}{*}{Coverage}  & \scp & 0.81 & 0.50 & 0.03 & 0.84 & 0.84 & 0.87\\
        \addlinespace
        \addlinespace
        & \texttt{NAIVE} & 0.88 & 0.69 & 0.14 & 0.76 & 0.48 & 0.83 \\
        & \rcp & 0.88 & 0.63 & 0.14 & 0.99 & 0.91 & 0.93 \\
        & \texttt{ETA} & 0.81 & 0.62 & 0.05 & 0.84 & 0.87 & 0.87  \\ 
        \addlinespace
        \addlinespace
        &  $\ecp_1$ & 0.86 & 0.61 & 0.10 & 0.84 & 0.87 & 0.93 \\
        & $\ecp_2$ & \textbf{0.91} & 0.72& 0.27 & 0.88 & 0.90& \textbf{0.96}\\
        & $\eacp_1$ & 0.86& 0.71& 0.14 & 0.84 & 0.90& 0.93\\
        & $\eacp_2$ & \textbf{0.91} & \textbf{0.80} & \textbf{0.30} & \textbf{0.89} & \textbf{0.93} & 0.94\\
        \addlinespace
        \hline
        \addlinespace
        \multirow[c]{10}{*}{Set Size}  & \scp & 2.5& 3.4& 3.4& 3.9& 81.8& 6.2\\
        \addlinespace
        \addlinespace
        & \texttt{NAIVE} & 11.7 & 20.9& 12.7& 2.5& 6.4& 5.8 \\
        & \rcp & 5.5& 10.7& 9.6& 125& 166& 10.2  \\
        & \texttt{ETA} & 2.5& 3.0& 3.6& 3.8& 100&6.5 \\ 
        \addlinespace
        \addlinespace
        &  $\ecp_1$ & 4.2& 9.1& 7.4& 3.8& 105& 10.3\\
        & $\ecp_2$ & 7.6& 23.3 & 15.1& 5.5& 137& 15.3\\
        & $\eacp_1$ & 4.5& 6.8& 8.7 & 3.7& 133 & 11.1\\
        & $\eacp_2$ & 8.7& 16.1& 10.1& 5.6& 177& 16.4\\ 
        \hline
        
    \end{tabular}}
\end{table*}

\begin{table*}[!ht]
    \caption{Coverage on four different corruption types representing each ImageNet-C category. Compared to the baselines, $\ecp_2$ closes the coverage gap on most severity levels, while $\eacp_2$ further improves this by achieving the target coverage rate 0.90 on nearly all corruption types and severities.}
\centering
    \resizebox{\textwidth}{!}{

    \begin{tabular}{@{}lccccc|ccccc|ccccc|ccccc@{}}
       \multirow[c]{2}{*}{Method} & \multicolumn{5}{c}{Contrast}  &   \multicolumn{5}{c}{Brightness}    &  \multicolumn{5}{c}{Gaussian Noise}  &  \multicolumn{5}{c}{Motion Blur}  \\
    &   1 & 2 & 3 &4 & 5 & 1 & 2 & 3 &4 & 5& 1 & 2 & 3 &4 & 5& 1 & 2 & 3 &4 & 5  \\
    \hline
        \addlinespace


      \texttt{NAIVE}     & \cellcolor{green!25}0.91 & \cellcolor{yellow!25}0.89 & \cellcolor{yellow!25}0.87 & \cellcolor{yellow!25}0.83 & \cellcolor{orange!25}0.76 & \cellcolor{green!25}0.92 & \cellcolor{green!25}0.92 & \cellcolor{green!25}0.91 & \cellcolor{green!25}0.91 & \cellcolor{green!25}0.90 & \cellcolor{yellow!25}0.88 & \cellcolor{yellow!25}0.85 & \cellcolor{orange!25}0.79 & \cellcolor{orange!25}0.69 & \cellcolor{orange!25}0.79 & \cellcolor{green!25}0.91 & \cellcolor{green!25}0.90 & \cellcolor{yellow!25}0.85 & \cellcolor{orange!25}0.77 & \cellcolor{orange!25}0.71  \\
          \addlinespace


      \scp\ \citep{sadinle_least_2019}   & \cellcolor{yellow!25}0.83 & \cellcolor{orange!25}0.78 & \cellcolor{orange!25}0.66 & \cellcolor{red!25}0.36 &\cellcolor{red!25}0.09 & \cellcolor{yellow!25}0.88 & \cellcolor{yellow!25}0.87 & \cellcolor{yellow!25}0.86 &\cellcolor{yellow!25} 0.83 & \cellcolor{orange!25}0.78 & \cellcolor{orange!25}0.79 & \cellcolor{orange!25}0.69 & \cellcolor{red!25}0.50 & \cellcolor{red!25}0.26 & \cellcolor{red!25}0.07 & \cellcolor{yellow!25}0.83 & \cellcolor{orange!25}0.74 & \cellcolor{orange!25}0.57 & \cellcolor{red!25}0.37 & \cellcolor{red!25}0.27  \\
          \addlinespace


      \texttt{ETA} \citep{niu2022efficient}        & \cellcolor{yellow!25}0.87 & \cellcolor{yellow!25}0.86 & \cellcolor{yellow!25}0.84 & \cellcolor{orange!25}0.79 & \cellcolor{orange!25}0.63 & \cellcolor{yellow!25}0.88 & \cellcolor{yellow!25}0.88 & \cellcolor{yellow!25}0.87 & \cellcolor{yellow!25}0.86 & \cellcolor{yellow!25}0.84 & \cellcolor{yellow!25}0.86 & \cellcolor{yellow!25}0.82 & \cellcolor{orange!25}0.76 & \cellcolor{orange!25}0.69 & \cellcolor{orange!25}0.54 & \cellcolor{yellow!25}0.86 & \cellcolor{yellow!25}0.84 & \cellcolor{yellow!25}0.80 & \cellcolor{orange!25}0.73 & \cellcolor{orange!25}0.68  \\
          \addlinespace
          % \addlinespace
          
       % $\ecp_1$ (ours) & \cellcolor{green!25}0.90 & \cellcolor{yellow!25}0.87 & \cellcolor{yellow!25}0.80 & \cellcolor{orange!25}0.57 & \cellcolor{red!25}0.26 & \cellcolor{green!25}0.92 & \cellcolor{green!25}0.92 & \cellcolor{green!25}0.90 &\cellcolor{yellow!25}0.89 & \cellcolor{yellow!25}0.87 & \cellcolor{yellow!25}0.85 & \cellcolor{yellow!25}0.80 & \cellcolor{orange!25}0.66 & \cellcolor{orange!25}0.65 & \cellcolor{red!25}0.17 & \cellcolor{yellow!25}0.89 & \cellcolor{yellow!25}0.85 & \cellcolor{orange!25}0.73 & \cellcolor{orange!25}0.57 & \cellcolor{red!25}0.46  \\

       $\ecp_2$ (ours) & \cellcolor{green!25}0.93 & \cellcolor{green!25}0.92 & \cellcolor{yellow!25}0.89 & \cellcolor{orange!25}0.79 & \cellcolor{orange!25}0.60 & \cellcolor{green!25}0.94 & \cellcolor{green!25}0.94 & \cellcolor{green!25}0.94 &\cellcolor{green!25}0.93 & \cellcolor{green!25}0.92 & \cellcolor{green!25}0.92 & \cellcolor{yellow!25}0.88 & \cellcolor{yellow!25}0.80 & \cellcolor{yellow!25}0.86 & \cellcolor{orange!25}0.38 & \cellcolor{green!25}0.94 & \cellcolor{green!25}0.92 & \cellcolor{yellow!25}0.86 & \cellcolor{orange!25}0.75 & \cellcolor{orange!25}0.68  \\
          \addlinespace
          % \addlinespace
          % \addlinespace

          
      % $\eacp_1$ (ours) & \cellcolor{green!25}0.90 & \cellcolor{green!25}0.90 & \cellcolor{green!25}0.90 & \cellcolor{yellow!25}0.86 & \cellcolor{orange!25}0.77 & \cellcolor{green!25}0.91 & \cellcolor{green!25}0.91 & \cellcolor{green!25}0.91 &\cellcolor{green!25}0.90 & \cellcolor{yellow!25}0.89 & \cellcolor{green!25}0.90 & \cellcolor{yellow!25}0.88 & \cellcolor{yellow!25}0.85 & \cellcolor{green!25}0.90 & \cellcolor{orange!25}0.71 & \cellcolor{green!25}0.90 & \cellcolor{yellow!25}0.89 & \cellcolor{yellow!25}0.87 & \cellcolor{yellow!25}0.83 & \cellcolor{yellow!25}0.80  \\

      $\eacp_2$ (ours)    & \cellcolor{green!25}0.93 & \cellcolor{green!25}0.93 & \cellcolor{green!25}0.93 & \cellcolor{green!25}0.92 & \cellcolor{yellow!25}0.87 & \cellcolor{green!25}0.93 & \cellcolor{green!25}0.93 & \cellcolor{green!25}0.93 & \cellcolor{green!25}0.93 & \cellcolor{green!25}0.93 & \cellcolor{green!25}0.93 & \cellcolor{green!25}0.93 & \cellcolor{green!25}0.92 & \cellcolor{green!25}0.90 & \cellcolor{yellow!25}0.84 & \cellcolor{green!25}0.93 & \cellcolor{green!25}0.92 & \cellcolor{green!25}0.92 & \cellcolor{green!25}0.91 & \cellcolor{yellow!25}0.89  \\
          % \addlinespace


    \end{tabular}    
}
    \label{tab:in-c}
\end{table*}

\paragraph{Datasets}\label{sec:datasets}
We investigate a number of ImageNet \citep{deng2009imagenet} variants including: ImageNet-V2 \citep{Recht2019DoIC}, ImageNet-R \citep{hendrycks2021many}, and ImageNet-A \citep{hendrycks2021nae}. We also test our approach on datasets from the WILDS Benchmark \citep{wilds2021} which represent in-the-wild distribution shifts across many real world applications, including iWildCam (animal trap images), RXRX1 (cellular images), and FMOW (satellite images).

While the previous datasets present a single distribution shift, the ImageNet-C \citep{Hendrycks2018BenchmarkingNN} dataset allows us to investigate shifts across many types and severities. Specifically, ImageNet-C applies 19 visual corruptions to the ImageNet validation set across four corruption categories — noise, blur, weather, and digital, with five severity levels for each corruption. See Appendix \ref{appendix:dataset-details} for more details on the datasets.

% \subsection{Stationary shifts}



\paragraph{Stationary shifts} Table \ref{tab:natural_shift} summarizes our results on various natural distribution-shifted datasets. We observe that \scp\ (with or without TTA) can exhibit significant gaps with respect to the target coverage rate, whereas \ecp\ closes the gap quite effectively while maintaining meaningful set sizes. Coverage is further improved via \eacp, which also helps reducing set sizes on some datasets. In general, we also observe an improvement over \texttt{RC} and \texttt{NAIVE}: the linear scaling variant of our methods has similar coverage rates as these baselines, while the set sizes are typically smaller. 

Here we can see the trade-off between linear and quadratic uncertainty scaling. $\eacp_2$ consistently achieves higher coverage rates, however this also leads to ``over-coverage'' on some datasets and thus larger sets. In contrast, $\eacp_1$ leads to lower coverage but also smaller set sizes. This trade-off can be selected by end-users based on their preference for more accurate or more efficient prediction sets. In subsequent experiments, we will focus on demonstrating the benefit of $\eacp_2$ on coverage, while noting that the observed set sizes are nonetheless practically useful and far from trivial.   




In Table \ref{tab:in-c}, we show fine-grained results on one corruption type for each ImageNet-C category, and across each severity level. Here we see the benefit of leveraging an uncertainty notion that can be directly minimized and refined on new test samples. Specifically, $\eacp_2$ is able to recover the target coverage rate on almost all corruption types and severities. 



\begin{figure*}[h]
    \centering
    \includegraphics[width=0.9\textwidth]{images/INC-avg-ecp.pdf}
        \caption{Our \textcolor{darkgreen}{$\eacp_2$} method is able to improve coverage using various neural network models and architectures, under a diverse range of distribution shifts. It consistently ``hugs'' the desired coverage rate, while maintaining practical set sizes. Results are averaged across five severity levels for each corruption type in the ImageNet-C dataset. We zoom in on the \textbf{right} to clearly show the benefit of adapting at test time. Larger markers reflect a larger neural network parameter count. }
        \label{fig:inc-sev5}
\end{figure*}

\begin{table*}[h!]
    \centering
    \caption{We evaluate performance on the challenging setting of continuously shifting distributions. The ``label free'' column denotes whether a method relies on labels at test-time from the target data. We recall that \scp\ does not adapt to new data. In addition to the \textbf{average coverage ($\uparrow$)}  and \textbf{average size ($\downarrow$)}, we also measure the worst local coverage error \textbf{LCE\textsubscript{128} ($\downarrow$)} and worst local set size \textbf{LSS\textsubscript{128}, ($\downarrow$)} on a sliding window of 128 test points.}
    \vspace{1mm}
    \resizebox{0.9\textwidth}{!}{
    \small
    \begin{tabular}{c@{\hskip 0.1in}l@{\hskip 0.1in}c@{\hskip 0.1in}c@{\hskip 0.1in}c@{\hskip 0.1in}c@{\hskip 0.2in}c@{\hskip 0.1in}c@{\hskip 0.1in}c@{\hskip 0.1in}c}
         & & \multicolumn{4}{c}{Gradual shift} & \multicolumn{4}{c}{Sudden shift} \\
         \cmidrule(lr){3-6} \cmidrule(lr){7-10}
         Label Free & Method & Avg. Cov & Avg. Size & LCE\textsubscript{128}  & LSS\textsubscript{128} & Avg. Cov & Avg. Size & LCE\textsubscript{128}  & LSS\textsubscript{128} \\
         \hline
        \addlinespace
          - &  \scp\ \citep{sadinle_least_2019} & 0.59 & 3.1 & 0.70 & 3.6 & 0.59 & 2.8 & 0.71 & 3.5 \\
          \addlinespace
          \xmark  & \texttt{SAOCP} \citep{bhatnagar2023improved} & 0.79 & 145 & 0.24 & 353 & 0.78 & 139 & 0.28 & 349 \\
         \xmark & \texttt{DtACI} \citep{gibbs2024conformal} & 0.90 & 101 & 0.07 & 455 & 0.90 & 142 & 0.09 & 450 \\
         \xmark & \texttt{MAGL-D} \citep{zhang2024discounted} & 0.90 & 403 & 0.05 & 856 & 0.90 & 355 & 0.05 & 844 \\
          \xmark& \texttt{MAGL} \citep{zhang2024discounted} & 0.90 & 117 & 0.06 & 573 & 0.90 & 168 & 0.3 & 704 \\
         \xmark & \texttt{MAGDIS} \citep{zhang2024discounted} & 0.90 & 417 & 0.06 & 841 & 0.90 & 372 & 0.07 & 852 \\
           \addlinespace
         \cmark & \texttt{ETA} \citep{niu2022efficient} & 0.69 & 2.9 & 0.52 & 3.4 & 0.67 & 2.7 & 0.54 & 3.5 \\
         \cmark & $\ecp_2$ (ours) & 0.84 & 36.6 & 0.35 & 90.4 & 0.82 & 37.5 & 0.38 & 88.5 \\
         \cmark & $\eacp_2$ (ours) & 0.88 & 22.4 & 0.20 & 47.8 & 0.86 & 23.1 & 0.28 & 55.7 \\
                 \hline

    \end{tabular}
    }
    \label{tab:inc-stream}
\end{table*}



Next, Figure \ref{fig:inc-sev5} contains the results using neural networks of various architectures and parameter counts, on all 19 corruption types of ImageNet-C (average across five severity levels). Besides showing the superior performance of our methods, we observe that the \scp\ baseline (with and without TTA) generates prediction sets with little variance in the set sizes, regardless of the achieved coverage rates. We argue that this is an undesirable behavior, as the set sizes themselves are often used to encode uncertainty evaluations by set-valued classifiers. Our results demonstrate that explicitly incorporating the base model's own uncertainty into CP can help mitigating this issue. 

% \subsection{Continuous shifts}

\paragraph{Continuous shifts} Finally, we investigate continuous distribution shifts, and the results are shown in Table \ref{tab:inc-stream}. This has been previously studied under \emph{online conformal prediction}, and we build on the experimental setup of \citep{saocp,zhang2024discounted}. Specifically, the environment shifts between ImageNet-C severity level 1 to level 5 (either suddenly or gradually; see Figure \ref{fig:cont-shifts-full}), while sampling random corruptions at each corresponding severity. We emphasize that this is a particularly challenging task, as it presents a continuous shift in both the magnitude as well as type of corruption. See Appendix \ref{appendix:exp-prot} for more details on this experiment. Here, we compare with existing supervised methods that rely on the correct label being revealed after every prediction. 




Overall, our methods demonstrate competitive performance with respect to supervised baselines: the average set sizes are significantly smaller despite a slight drop in the average coverage rate. We also measure the local coverage error LCE\textsubscript{128} across the worst sliding window of 128 samples, and similarly the worst local set size, LSS\textsubscript{128}. While the supervised methods unsurprisingly result in better local coverage, they also lead to local set sizes that are much larger.

\section{Conclusion}
This paper studies how to improve set-valued classification methods on distribution-shifted data, without relying on labels from the target dataset. This is an important challenge in many real world settings, where exchangeability assumptions are violated and labels may be difficult to attain. We propose an uncertainty-aware method based on the prediction entropy (\ecp), and leverage unsupervised test time adaptation to update the base model and refine its uncertainty (\eacp). We demonstrate that the proposed methods are able to recover the desired error rate on a wide range of distribution shifts, while maintaining efficient set sizes. Furthermore, they are even competitive with supervised approaches on challenging and continuously shifting distributions. We hope this inspires future works continuing to tackle this important challenge. 

\begin{acknowledgements}

We appreciate the feedback from anonymous reviewers. ZZ and HY are partially funded by Harvard University Dean’s Competitive Fund for Promising Scholarship. KK and GWT acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Ontario Ministry of Economic Development and Innovation under the Ontario Research Fund–Research Excellence (ORF-RE) program. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and \href{https://vectorinstitute.ai/\#partners/}{companies sponsoring} the Vector Institute. 

\end{acknowledgements}

\bibliography{main}
\bibliographystyle{plainnat}


\newpage

\onecolumn

\title{Adapting Prediction Sets to Distribution Shifts Without Labels\\(Supplementary Material)}
\maketitle

% \section*{Appendix}
\appendix

In Appendix \ref{appendix:additional_related}, we provide an extensive overview of additional related works. Further, Appendix \ref{appendix:exp-dets} contains detailed information on our studied datasets and experimental protocols, including TTA hyper-parameters, CP procedure, and the setup for continual distribution shifts. Appendix \ref{scn: right-uncert} discusses other possible uncertainty measures and their deficiencies. Appendix~\ref{appendix:additional_experiment} contains additional experiment results. 

% Appendix \ref{appendix:other-tta} presents results with another base TTA method, Appendix \ref{appendix:arch-comp} has results of more neural network architectures on natural distribution shifts, Appendix \ref{appendix:cont-shift} has graphs performance on continuous-shifts at each time-step, and Appendix \ref{appendix:inc-all-sev} has full ImageNet-C results on each severity level.   

\section{Additional related works}\label{appendix:additional_related}

\paragraph{CP in decision making} Our interest in the considered setting -- distribution shifts without test time labels -- is mainly motivated by the growing applications of CP in autonomous decision making. A very much incomplete list: see \citep{lekeufack2023conformal} for a generic treatment; \citep{lindemann2023safe} for trajectory optimization in robotics; \citep{yang23cvpr-object,gao24arxiv-closure} for 3D vision; \citep{kumar2023conformal,cherian2024large,gui2024conformal,mohri2024language,quach2024conformal} for large language models (LLMs); and \citep{knowno2023} for LLM-powered robotics. 

\paragraph{CP under distribution shifts} As discussed in the main paper, considerable efforts have been devoted to developing CP methods robust to distribution shifts. We now survey two possible directions and their respective limitations. 
\begin{itemize}
\item The first direction does not require test time labels, but the distribution shift is assumed to be simple in some sense. For example, \citet{tibshirani2019conformal} studied CP under \emph{covariate shifts}, where the distribution of the label $y$ conditioned on the covariate $x$ remains unchanged. Here, it suffices to use the classical \emph{likelihood ratio reweighting} on the calibration dataset, but accurately estimating the likelihood ratio can be challenging in practice. Another idea is to take a robust optimization perspective by assuming a certain maximum level of distribution shift and protecting against the worst case, e.g., \citep[Chapter~8]{roth2022uncertain} and \citep{cauchois2024robust}. The weakness here is the sensitivity to the hyperparameter, and the obtained prediction sets could be overly conservative. 

Various works built on these two ideas. \citet{barber2023conformal} generalized the reweighting idea to handle mild but general distribution shifts, but choosing the weights is generally unclear in practice. \citet{ai2024not} tackled general distribution shifts by combining reweighting and robust optimization, which also combines the strengths and limitations from the two sides. \citet{ge2024optimal} extended the two ideas to the aggregation of multiple CP algorithms. 

\item The second direction is connecting CP to adversarial online learning. A line of works \citep{gibbs2021adaptive,angelopoulos2023conformal,gibbs2024conformal, saocp, zhang2024discounted} applied regret minimization algorithms in OCO to select the score threshold in CP, and \cite{bastani2022practical} achieved this task using \emph{multicalibration}. By relaxing the CP objective from the \emph{coverage probability} to the \emph{post-hoc coverage frequency}, these methods can handle arbitrary continual distribution shifts. However, they require the true label to be provided after every prediction, which is a limiting requirement for many use cases in autonomous decision making. Our experiments will show that it is possible to achieve comparable performance in these settings without this limitation, i.e., being ``label free''.
\end{itemize}

\section{Experimental details}\label{appendix:exp-dets}

\subsection{Dataset details}\label{appendix:dataset-details}
We perform experiments on a number of large-scale datasets that are frequently used to evaluate deep learning performance under distribution shift \citep{wilds2021, wang2021tent, minderer2021revisiting, niu2022efficient,zhang2022memo, bhatnagar2023improved, zhang2024discounted}:
\begin{itemize}
    \item \textbf{ImageNet-V2} \citep{Recht2019DoIC} is an ImageNet test-set that contains 10,000 images that were collected by closely following the original ImageNet data collection process. 
    \item \textbf{ImageNet-R} \citep{hendrycks2021many} includes renditions (e.g., paintings, sculptures, drawings, etc.) of 200 ImageNet classes, resulting in a test set of 30,000 images. 
    \item \textbf{ImageNet-A} \citep{hendrycks2021nae} consists of 7,500 real-world, unmodified, and naturally occurring adversarial images which a ResNet-50 model failed to correctly classify. 
    \item \textbf{ImageNet-C} \citep{Hendrycks2018BenchmarkingNN} applies 19 visual corruptions across four categories and at five severity levels to the original ImageNet validation set.
    \item \textbf{iWildCam} \citep{wilds2021, beery2020iwildcam} contains camera-trap images from different areas of the world, representing geographic distribution-shift. It includes a validation set of 7,314 images from the same camera traps the model was trained on, which is used as our calibration data, as well as 42,791 images from different camera traps that is used as our test set. The images contain one of the 182 possible animal species. 
    \item \textbf{RXRX1} \citep{wilds2021, taylor2019rxrx1} consists of high resolution fluorescent microscopy images of human cells which have been given one of 1,139 genetic treatments, with the goal of generalizing across experimental batches. It is split into a 40,612 in-distribution validation set and 34,432 test set.
    \item \textbf{FMOW} \citep{wilds2021, christie2018functional} is a satellite imaging dataset with the goal of classifying images into one of 62 different land use or building types. It consists of 11,483 validation images from the years from 2002–2013, and 22,108 test images from the years from 2016–2018.
    \end{itemize}

\subsection{Experimental protocols}\label{appendix:exp-prot}

\paragraph{Conformal prediction} Our split conformal prediction set-up follows previous works \citep{angelopoulos2021uncertainty, angelopoulos2022gentle}, which divides a held-out dataset into a calibration and test set. On ImageNet variants, we split the original validation set in half to produce 25,000 calibration points and 25,000 in-distribution test points. The calibrated scores and / or threshold are then used for subsequent distribution-shifted data. On the WILDS datasets, we similarly split the in-distribution validation sets. 

\paragraph{Adaptation procedure} Our ImageNet-based experiments are conducted on pre-trained ResNets provided by the \textit{torchvision} library\footnote{\url{https://github.com/pytorch/vision}}, and ViTs provided by the \textit{timm} library \footnote{\url{https://github.com/huggingface/pytorch-image-models}}. Experiments on WILDS datasets are conducted using pre-trained models provided by the authors of that study \footnote{\url{https://github.com/p-lambda/wilds}}. For \eacp\ and \texttt{ETA}, we closely follow the optimization hyperparameters from the original paper \citep{niu2022efficient}: we use SGD optimizer with a momentum of 0.9 and learning rate of 0.00025. We use  a batch size of 64 for all ImageNet experiments, 128 for RXRX1 and FMOW, and 42 for iWildCam. Our experiments are conducted using a single NVIDIA A40 GPU. 

\paragraph{Continuous shift} We adopt a slightly modified version of the experimental design for continuous distribution shift presented in previous works \citep{bhatnagar2023improved, zhang2024discounted}. This involves sampling random corruptions from the ImageNet-C dataset under two regimes: \textbf{gradual shifts} where the severity level first increases in order from $\{1,...,5\}$ then decreases from $\{5,...,1\}$, and sudden shifts where the severity level alternates between 1 and 5. In addition to sampling random corruptions, we also consider in Figure \ref{fig:cont-shifts-full} results on the ``easier'' setting of shifting severities on a single corruption type.    


% \section{Algorithm Pseudo-code}\label{appendix:pseudocode}



\section{What is the right measure of uncertainty?}\label{scn: right-uncert}

\begin{figure}[ht]
    \centering
    % \begin{subfigure}[b]{0.3\textwidth}
    %     \centering
    %     \includegraphics[width=\textwidth]{images/ent_vs_smx.pdf}
    %     \caption{Prediction entropy}
    %     % \label{fig:ent_vs_smx}
    % \end{subfigure}
    % \hspace{0.02\textwidth}
    \begin{subfigure}[b]{0.45\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/var_vs_smx.pdf}
        \caption{Softmax variance}
        \label{fig:var_vs_smx}
    \end{subfigure}
    \hspace{0.02\textwidth}
    \begin{subfigure}[b]{0.45\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/abs_vs_smx.pdf}
        \caption{1 - Max softmax}
        \label{fig:max_vs_smx}
    \end{subfigure}
    \caption{Similarly to Figure \ref{fig:ent_vs_smx}, we present the relation between different uncertainty measures and the average score of the true label. We see that softmax variance (left) has an inverse relation with distribution shift, and $1-$ maximum softmax is a bounded metric that may provide an insufficient adjustment.}
    \label{fig:uncert_vs_smx}
\end{figure}
Although in Section \ref{scn:scaling-uncert} we propose adjusting the conformal scores by the prediction entropy of the base model, it is worth asking if there exist other notions of uncertainty that may instead be used. Here, we consider two additional uncertainty measures and their relation with the softmax value of the true label (which is ultimately what we would like to include in our prediction set), and show they are ill-suited for our task.  Firstly, in Figure \ref{fig:var_vs_smx} we consider the variance of the softmax scores. Perhaps surprisingly, we see that distribution shift most often leads to a \emph {smaller} variance, thus conveying that the base model is \emph{less} uncertain. This suggests that softmax variance is a deficient uncertainty measure as it fails to capture the actual underlying uncertainty on distribution-shifted data.


We also consider $1-$ maximum softmax score as another possible uncertainty measure, and see in Figure \ref{fig:max_vs_smx} that distribution shift is associated with smaller maximum softmax values. Unlike softmax variance, this does appear to better capture the uncertainty, as we would expect the base model to be less confident on distribution-shifted data. However, this uncertainty measure can still only take a maximum value of one and thus may not provide necessary adjustment magnitude, and it is unknown if it can reliably update the base model label-free. 

While there may exist better uncertainty measures that future works can explore, these results suggest that the prediction entropy is a simple and reliable measure for conformal adjustments that can effectively capture the underlying uncertainty. 

\section{Additional experiments}\label{appendix:additional_experiment}

\subsection{Other TTA methods}\label{appendix:other-tta}

We investigate our methods performance with another base TTA method in Table \ref{tab:tent-shift}. Here, we use the \texttt{Tent} update \citep{wang2021tent}, which is a simpler version of \texttt{ETA} with no re-weighing of the entropy loss. While our proposed methods are also compatible with \texttt{Tent}, we notice that the more powerful \texttt{ETA} leads to better coverage and set sizes as seen in Table \ref{tab:natural_shift}.  We can expect that additional improvements in TTA will similarly lead to improvements in our \eacp method.  

\begin{table*}
    \caption{Our proposed \eacp\ performs well with other TTA methods, as seen here using Tent \cite{wang2021tent} as the TTA update.}

    \centering
    % \resizebox{\textwidth}{!}{
    \begin{tabular}{lcccccc}
        \multirow[c]{2}{*}{\textbf{Dataset}} & \multicolumn{1}{c}{\textbf{\scp}} & \multicolumn{1}{c}{\textbf{\texttt{Tent}}} & \multicolumn{1}{c}{\textbf{$\ecp_2$}} & \multicolumn{1}{c}{\textbf{$\eacp_2$}}  \\
         & (coverage / set size) & (coverage / set size) & (coverage / set size) &  (coverage / set size) \\
        \hline
        \addlinespace

        ImageNet-V2 & 0.81 / 2.5  & 0.81 / 2.6 & 0.91 / 8.0 & 0.92 / 9.6  \\
         ImageNet-R & 0.50 / 3.2 & 0.58 / 3.3 & 0.73 / 23 & 0.77 / 17  \\
         ImageNet-A & 0.07 / 1.5 & 0.21 / 3.1 & 0.58 / 204  & 0.40 / 24  \\
         \addlinespace
         iWildCam & 0.83 / 3.5 & 0.81 / 2.6 & 0.89 / 5.7 & 0.85 / 3.4  \\
         RXRX1 & 0.85 / 83 & 0.87 / 101 & 0.90 / 136.7 & 0.92 / 176  \\
         FMOW & 0.87 / 6.3 & 0.85 / 5.7 & 0.96 / 15.6 & 0.94 / 13.4  \\
         \hline
    \end{tabular}
    % }
    \label{tab:tent-shift}
\end{table*}

\subsection{More architecture comparisons}\label{appendix:arch-comp}

In Table \ref{tab:arch-comp-natshift}, we further demonstrate our methods improvements to coverage loss on natural distribution shifts using diverse neural network architectures. As expected, larger and more accurate neural networks result in better coverage and smaller set sizes using \ecp\ and \eacp\. This is encouraging as it demonstrates our methods can scale along with the underlying model.  

\begin{table*}[h]
    \caption{On natural distribution shifts, the performance of our methods scale well with the performance of the base classifier. This is encouraging as it suggests compatibility  }

    \centering    
    \resizebox{\textwidth}{!}{

    \begin{tabular}{llcccccc}
        \multirow[c]{2}{*}{\textbf{Dataset}} & \multirow[c]{2}{*}{\textbf{Model}} &  \multicolumn{1}{c}{\textbf{\scp}} & \multicolumn{1}{c}{\textbf{\texttt{ETA}}} & \multicolumn{1}{c}{\textbf{$\ecp_2$}} & \multicolumn{1}{c}{\textbf{$\eacp_2$}}  \\
         & & (coverage / set size) & (coverage / set size) & (coverage / set size) &  (coverage / set size) \\
        \hline
        \addlinespace

         \multirow[c]{4}{*}{\textbf{ImageNet-V2}} & Resnet50 & 0.81 / 2.5  & 0.81 / 2.5 & 0.91 / 7.6 & 0.91 / 8.7  \\
         & Resnet152 & 0.81 / 2.0  & 0.81 / 2.1 & 0.89 / 4.6 & 0.91 / 6.3  \\
         & Vit-S & 0.80 / 1.5  & 0.80 / 1.5 & 0.90 / 3.4 & 0.90 / 3.4  \\
         & ViT-B & 0.80 / 1.2  & 0.80 / 1.2 & 0.90 / 2.4 & 0.90 / 2.4  \\
         
         \addlinespace
         
         \multirow[c]{4}{*}{\textbf{ImageNet-R}} & Resnet50 & 0.50 / 3.4 & 0.62 / 3.0 & 0.72 / 23.3 & 0.80 / 16.1  \\
         & Resnet152 & 0.53 / 2.7  & 0.60 / 2.6 & 0.71 / 15.3 & 0.79 / 17.3  \\
         & Vit-S & 0.52 / 1.3  & 0.53 / 1.3 & 0.74 / 12.3 & 0.75 / 11.8  \\
         & ViT-B & 0.58 / 0.9  & 0.59 / 0.9 & 0.78 / 8.3 & 0.79 / 8.0  \\
         
\addlinespace

         \multirow[c]{4}{*}{\textbf{ImageNet-A}}&Resnet50 & 0.03 / 3.4 & 0.05 / 3.6 & 0.27 / 15.1  & 0.30 / 19.1  \\
         & Resnet152 & 0.18 / 3.0  & 0.17 / 3.3 & 0.43 / 11.8 & 0.50 / 19.6  \\
         & Vit-S & 0.37 / 1.7  & 0.37 / 1.7 & 0.65 / 8.4 & 0.66 / 8.3  \\
         & ViT-B & 0.47 / 1.2  & 0.47 / 1.2 & 0.76 / 6.5 & 0.76 / 6.4  \\

         \hline
    \end{tabular}}
    \label{tab:arch-comp-natshift}
\end{table*}

\subsection{Continuous shifts}\label{appendix:cont-shift}

In Figure \ref{fig:cont-shifts-full}, we visualize the coverage and set-sizes of our unsupervised methods and a number of supervised baselines on the previously described continuous distribution shifts. We show results on random corruption types as well as fixed corruption types. Our proposed methods perform well across all these settings; they closely maintain coverage even on sudden and severe shifts, while leading to substantially smaller set sizes than the baselines. 


\begin{figure*}[h]
    \centering
    \begin{subfigure}{0.7\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/inc_stream_mixed.pdf}
        \caption{Shifting (random) corruptions}
    \end{subfigure}
    \vfill
    \begin{subfigure}{0.7\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/inc_stream_contrast.pdf}
        \caption{Contrast corruption}
    \end{subfigure}
    \vfill
    \begin{subfigure}{0.7\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/inc_stream_brightness.pdf}
        \caption{Brightness corruption}
    \end{subfigure}
    % \caption{Main caption for the figure (continued on next page)}
\end{figure*}

\begin{figure*}[h]\ContinuedFloat
    \centering
    \begin{subfigure}{0.7\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/inc_stream_gaussian_noise.pdf}
        \caption{Gaussian noise corruption}
    \end{subfigure}
    \vfill
    \begin{subfigure}{0.7\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/inc_stream_motion_blur.pdf}
        \caption{Motion blur corruption}
    \end{subfigure}
    \caption{Our unsupervised methods \ecp\ and \eacp\ are able to provide nearly the same empirical coverage, and considerably smaller set sizes, that supervised methods on continuously shifting distributions. Dashed lines denote methods that rely on a ground truth label being revealed at test time. }
    \label{fig:cont-shifts-full}
\end{figure*}


\subsection{ImageNet-C all severity levels}\label{appendix:inc-all-sev}

In Figure \ref{fig:inc-all-sevs}, we present full results across all ImageNet-C severity levels. We see that our method is effective in recovering coverage even under many highly severe distribution shifts, and nearly always recovers the desired coverage on less severe shifts.   

\begin{figure*}[h]
    \centering
    \begin{minipage}{0.4\textwidth}
        \centering
        \begin{subfigure}[b]{\linewidth}
            \includegraphics[width=\linewidth]{images/INC-sev1.pdf}
            \caption{Severity level 1}
            \label{fig:sev1}
        \end{subfigure}
    \end{minipage}
    \hfill
    \begin{minipage}{0.4\textwidth}
        \centering
        \begin{subfigure}[b]{\linewidth}
            \includegraphics[width=\linewidth]{images/INC-sev2.pdf}
            \caption{Severity level 2}
            \label{fig:sev2}
        \end{subfigure}
    \end{minipage}

    \vspace{0.1cm}

    \begin{minipage}{0.4\textwidth}
        \centering
        \begin{subfigure}[b]{\linewidth}
            \includegraphics[width=\linewidth]{images/INC-sev3.pdf}
            \caption{Severity level 3}
            \label{fig:sev3}
        \end{subfigure}
    \end{minipage}
    \hfill
    \begin{minipage}{0.4\textwidth}
        \centering
        \begin{subfigure}[b]{\linewidth}
            \includegraphics[width=\linewidth]{images/INC-sev4.pdf}
            \caption{Severity level 4}
            \label{fig:sev4}
        \end{subfigure}
    \end{minipage}

    \vspace{0.1cm}

    \begin{minipage}{0.4\textwidth}
        \centering
        \begin{subfigure}[b]{\linewidth}
            \includegraphics[width=\linewidth]{images/INC-sev5.pdf}
            \caption{Severity level 5}
            \label{fig:sev5}
        \end{subfigure}
    \end{minipage}

    \caption{Performance on 19 ImageNet-C corruptions on each severity level. \textcolor{darkgreen}{$\eacp_2$} hugs the desired coverage line on nearly all severity levels. Larger markers indicate larger parameter count.}
    \label{fig:inc-all-sevs}
\end{figure*}


\subsection{Oracle results}

Here we compare our methods with an oracle that has observed labels from the distribution-shifted dataset. Specifically, the oracle is the THR conformal prediction method \citep{sadinle_least_2019} that has been calibrated on half of the distribution-shifted dataset, following regular split conformal. Since the oracle is guaranteed to provide the desired coverage level in this set-up, our comparison focuses on the prediction set sizes; we refer to the main paper for coverage comparisons. We observe in Table \ref{tab:nat} that in every case except FMOW, a variant of \ecp and \eacp achieves smaller set sizes than the oracle. In Table 7, \eacp consistently achieves substantially smaller set sizes on ImageNet-C while also recovering error targets (see Table \ref{tab:in-c}). We reiterate here that smaller sets are preferred if error rates are maintained. 

% In Tables \ref{tab:nat} and \ref{tab:syn} we observe that $\eacp$ often returns sets \textit{substantially smaller} than this oracle, due to the entropy adaptation.  

\begin{table*}[ht]
    \caption{\ecp\ and \eacp\ achieve prediction set sizes that are often equal or smaller than the oracle method. Coverage rate is 0.90.}
    \vspace{1mm}
    \centering
    % \resizebox{\textwidth}{!}{
    \begin{tabular}{llcccccc}
         & Method & ImageNet-V2 & ImageNet-R & ImageNet-A & iWildCam & RXRX1 & FMOW \\
        \hline
        \addlinespace
        \multirow[c]{6}{*}{Set Size}  & \texttt{ORACLE} & 6.8 & 79.0 & 95.3 & 6.6 & 140 & 7.87 \\
        \addlinespace
        &  $\ecp_1$ & 4.2& 9.1& 7.4& 3.8& 105& 10.3\\
        & $\ecp_2$ & 7.6& 23.3 & 15.1& 5.5& 137& 15.3\\
        \addlinespace

        & $\eacp_1$ & 4.5& 6.8& 8.7 & 3.7& 133 & 11.1\\
        & $\eacp_2$ & 8.7& 16.1& 10.1& 5.6& 177& 16.4\\ 
        \hline
        
    \end{tabular}
    % }
    \label{tab:nat}
\end{table*}


\begin{table*}[h]
    \caption{Comparison of \ecp\ and \eacp\ on a subset of synthetic shifts. The numbers refer to severity level.}
    \centering
        \begin{tabular}{@{}lcccc|ccc|ccc|ccc@{}}
           & \multirow[c]{2}{*}{Method} & \multicolumn{3}{c}{Contrast}  &   \multicolumn{3}{c}{Brightness}  &  \multicolumn{3}{c}{Gaussian Noise}  &  \multicolumn{3}{c}{Motion Blur}  \\
         &  & 1 & 3 & 5 & 1 & 3 & 5 & 1 & 3 & 5 & 1 & 3 & 5 \\
         \hline
         & \texttt{ORACLE} & 5.5 & 30.3 & 562 & 2.5 & 3.7 & 9.8 & 6.2 & 70.6 & 317 & 9.7 & 101 & 638  \\
        \addlinespace

         % & $\ecp_1$ & 5.0 & 9.5 & 19\\
         \multirow[c]{1}{*}{Set Size} & $\ecp_2$ & 10.5 & 27.8 & 180 & 5.3 & 7.7 & 14.9 & 10.0 & 43.1 & 109 & 12.9 & 43.7 & 79.0 \\
         \addlinespace

         % & $\eacp_1$ & 3.3 & 4.1 \\
         & $\eacp_2$ & 5.5 & 7.4 & 25 & 4.5 & 5.7 & 7.6 & 5.7 & 16.0 & 42.7 & 6.3 & 12.8 & 25.5 \\
        \hline

        \end{tabular}   \label{tab:syn}
    
    % \label{tab:in-c}
\end{table*}

\subsection{In-distribution results}
In Table \ref{tab:ID-results}, we observe that our methods maintain coverage and reasonable set sizes on in-distribution data.

\begin{table*}[h]
    \caption{Results on in-distribution data using ImageNet-1k validation set.}
    \centering
        \begin{tabular}{@{}lccccc@{}}
          & $\scp$ & $\ecp_1$ & $\ecp_2$ & $\eacp_1$ & $\eacp_2$\\
         \hline
        Coverage & 0.90 & 0.92 & 0.94 & 0.91 & 0.93 \\
        Set size & 2.1 & 2.8 & 4.2 & 2.8 & 4.1\\
        \end{tabular}
        \label{tab:ID-results}
    
    % \label{tab:in-c}
\end{table*}

\subsection{Comparison with weighted CP}

\citet{tibshirani2019conformal} present a method for improving coverage under covariate shift by re-weighing calibration scores based on an estimated likelihood ratio ($\texttt{wcp}$). Although estimating likelihood ratios in our setting is challenging, we nonetheless present a comparison here for completeness. We follow their approach and train a probabilistic classifier, here a CNN, on each calibration-test pair. 

Table \ref{tab:wcp-comp} suggests that this method may have limited performance in our studied setting. This may be due to the challenge in estimating accurate likelihood ratios in high-dimensional settings, \citep{cauchois2024robust}. We do not claim that $\texttt{wcp}$ definitely \textit{cannot} perform well, however the sparsity of previous literature here suggests that further studies may be required. Finally, note that $\texttt{wcp}$ is ill-suited for the case of continuously shifting distributions, further limiting its general applicability. 

\begin{table*}[ht]
    \caption{The $\texttt{wcp}$ method appears to provide minimal coverage improvements in this setting, possibly due to the difficulty in estimating likelihood ratios.\label{tab:wcp-comp}}
    \vspace{1mm}
    \centering
    % \resizebox{\textwidth}{!}{
    \begin{tabular}{llcccccc}
         & Method & ImageNet-V2 & ImageNet-R & ImageNet-A  \\
        \hline
        \multirow[c]{8}{*}{Coverage}  & \scp & 0.81 & 0.50 & 0.03 \\
        \addlinespace
        \addlinespace
        & \texttt{wcp} & 0.82 & 0.35 & 0.06   \\ 
        \addlinespace
        \addlinespace
        & $\ecp_2$ & 0.91 & 0.72& 0.27 \\
        & $\eacp_2$ & 0.91 & 0.80 & 0.30 \\
        \addlinespace
        \hline
        \addlinespace
        \multirow[c]{8}{*}{Set Size}  & \scp & 2.5& 3.4& 3.4 \\
        \addlinespace
        \addlinespace
        & \texttt{wcp} & 2.6 &0.74 & 4.3  \\ 
        \addlinespace
        \addlinespace
        & $\ecp_2$ & 7.6& 23.3 & 15.1\\
        & $\eacp_2$ & 8.7& 16.1& 10.1\\ 
        \hline
        
    \end{tabular}
    % }
\end{table*}


\subsection{Affects of Model Calibration}

We conduct experiments investigating affects of model calibration on the robustness of our entropy-based method. We employ temperature scaling with $T<1.0$ to deliberately sharpen the model's logits, inducing overconfidence and emulating a more poorly calibrated base model. We further compare with a temperature value determined to improve calibration (on in-distribution data), as measured using Expected Calibration Error (ECE). 

Results for two distribution-shifted datasets can be seen in Table \ref{tab:calibration-results}. Our results suggest that our method is fairly robust to miscalibration. Considerably higher ECE (worse calibration) leads to only minor drops in prediction set accuracy. This may potentially be due to our use of an entropy quantile, which aggregates uncertainty across the test data, mitigating issues caused by a smaller number of miscalibrated points.



\begin{table}[htbp]
\centering
\caption{Our use of an entropy quantile renders our method robust to poor model calibration.}
\begin{subtable}{\textwidth}
\centering
\caption{ImageNet-V2 coverage results with different temperature scaling.}
\begin{tabular}{lcccc}
\multirow{2}{*}{Method} & $T = 1.15$ & $T = 1.0$ & $T = 0.9$ & $T = 0.8$ \\
        & ECE = 0.02 & ECE = 0.03 & ECE = 0.06 & ECE = 0.08 \\
\hline
\scp & 0.81 & 0.81 & 0.81 & 0.81 \\
\texttt{ETA}     & 0.81 & 0.81 & 0.81 & 0.81 \\
\ecp     & 0.92 & 0.91 & 0.90 & 0.87 \\
\eacp    & 0.92 & 0.91 & 0.90 & 0.88 \\
\end{tabular}
\end{subtable}

\vspace{1em}

\begin{subtable}{\textwidth}
\centering
\caption{ImageNet-R coverage results with different temperature scaling.}
\begin{tabular}{lcccc}
\multirow{2}{*}{Method} & $T = 1.15$ & $T = 1.0$ & $T = 0.9$ & $T = 0.8$ \\
         & ECE = 0.02 & ECE = 0.03 & ECE = 0.06 & ECE = 0.08 \\
\hline
\scp & 0.49 & 0.50 & 0.50 & 0.51 \\
\texttt{ETA}     & 0.61 & 0.62 & 0.62 & 0.63 \\
\ecp     & 0.76 & 0.72 & 0.70 & 0.67 \\
\eacp    & 0.83 & 0.80 & 0.77 & 0.75 \\
\end{tabular}
\end{subtable}
\label{tab:calibration-results}
\end{table}

\end{document}
