
We next provide some background on conformal prediction (\autoref{subsec:background-cp}), group equivariance and invariance properties (\autoref{subsec:background-groups}), and the canonicalization framework (\autoref{subsec:background-canon}). Regarding notation, let $\gX \times \gY$ mark the sample space with some data-generating distribution $P$ over it, and $\rvx, \rvy$ random variables with realizations $\vx, \vy$. We denote any learnable functions, such as a prediction model $f_{\vtheta}: \gX \rightarrow \gY$, as mappings with learnable parameters $\vtheta \in \Theta$.

\subsection{Conformal Prediction}
\label{subsec:background-cp}

We consider the usual setting of \emph{split conformal prediction}\footnote{As opposed to full or cross-validation conformal schemes.}, wherein a hold-out calibration set ${\gD_{cal} = \{(\vx_i, \vy_i)\}_{i=1}^{n}}$ and test set ${\gD_{test} = \{(\vx_j, \vy_j)\}_{j=n+1}^{n+m}}$ are both sampled exchangeably (\ie~permutation invariantly, see \hyperref[def:exch]{Def.~\ref{def:exch}}) from some fixed distribution $P_0$ \citep{papadopoulos2007conformal}. Using a pre-specified scoring function ${s: \gX \times \gY \rightarrow \mathbb{R}}$ and pretrained predictor $f_{\vtheta}$, we compute a set of nonconformity scores $S = \{s_i\}_{i=1}^{n}$ on $\gD_{cal}$, where ${s_i = s(f_{\vtheta}(\vx_i), \vy_i)}$. These scores encode a desired notion of disagreement between predictions and responses, such as a simple residual score $s_i = |f_{\vtheta}(\vx_i) - \vy_i|$ for regression or predicted probability $s_i = 1 - p(\rvy_i = \vy_i | \vx_i)$ for classification. Next, a sample-corrected conformal quantile $Q_{1-\alpha}(F_S)$ is computed, where $F_S$ denotes the empirical distribution over the calibration scores\footnote{Extended with $\{ +\infty \}$ to ensure proper coverage adjustments.}, and $\alpha \in (0,1)$ a tolerated miscoverage rate. Given a new test sample $(\vx_{n+1}, \vy_{n+1})$, a prediction set is then constructed as ${C(\vx_{n+1}) = \{ \vy \in \gY: s(f_{\vtheta}(\vx_{n+1}), \vy) \leq Q_{1-\alpha}(F_S) \}}$, \ie, we include candidate responses whose score does not exceed the quantile. Exploiting the data's exchangeability under $P_0$, a formal coverage guarantee on inclusion of the true response $\vy_{n+1}$ can then be given with high probability as
\begin{equation}
\label{eq:cp-guarantee}
    \mathbb{P}(\vy_{n+1} \in C(\vx_{n+1})) \geq 1 - \alpha.
\end{equation}
We refer to \cite{shafer2008tutorial, angelopoulos2024theoretical} for details on the intuition and technical proofs.

\paragraph{Mondrian conformal prediction.} The coverage guarantee in \autoref{eq:cp-guarantee} only holds \emph{marginally} over $\gD_{cal} \cup \gD_{test}$, thus ensuring coverage in a broad sense. Stronger and more refined guarantees can be obtained by simply partitioning the data into sub-populations of interest, and running the conformal procedure per partition. We refer to this as \emph{partition-conditional} or \emph{mondrian} conformal prediction \citep{toccaceli2019combination}. If we consider a mapping $\phi: \gX \times \gY \rightarrow \{1, \dots, K\}$ assigning each sample to a data partition, the coverage guarantees hold per partition as 
\begin{equation}
\label{eq:cp-guarantee-mond}
    \mathbb{P}(\vy_{n+1} \in C(\vx_{n+1}) \,|\, \phi(\vx_{n+1}, \vy_{n+1}) = k) \geq 1 - \alpha
\end{equation}
for all $k \in \{1,\dots,K\}$. Data partitions of interest can include distinction by class label \citep{cauchois2021knowing}, feature properties \citep{m.sesia2021, c.jung2022}, or a balancing criterion like fairness \citep{y.romano2020a}.

\paragraph{Weighted conformal prediction.} To enhance data adaptivity and address settings of reduced exchangeability, a weighted formulation of CP is given by replacing the conformal quantile with $Q_{1-\alpha}(\tilde{F}_S)$, where $\tilde{F}_S$ now denotes the empirical distribution over a \emph{weighted} score set as 
\begin{equation}
\label{eq:weight-cp}
    \tilde{F}_S = \sum_{i=1}^{n} \tilde{w}_i \cdot \delta({s_i}) + \tilde{w}_{n+1} \cdot \delta({+\infty}),
\end{equation}
with $\delta({s_i})$ denoting the dirac delta centered at score $s_i$, and $\tilde{w}_i$ its associated normalized weight such that $\sum_{i=1}^{n} \tilde{w}_i = 1$. For example, \cite{barber2023conformal} suggest fixed weighting schemes such as upweighting more recent samples in a data stream setting, while \cite{guan2023localized} propose data-dependent (unnormalized) weights guided by feature distances such as the kernel distance $w_i = \exp\{-h\,|\vx_i - \vx_{n+1}|\}$.

\subsection{Group Equivariance and Invariance}
\label{subsec:background-groups}

Formally, we denote a symmetry group $G$ as a set of elements with a binary operator $\,\boldsymbol{\cdot}\,$ satisfying closure and associativity, and for which an identity element $e$ and inverses $g^{-1}$ exist such that $e \cdot g = g$ and $g^{-1} \cdot g = e$ respectively \citep{cohen2016group}. In our context, $G$ can be described as a structured space of possible symmetry transformations on the data. That is, a sample $\rvx \in \gX$ is transformed by a \emph{group action} as $\rho(g) \cdot \rvx$, where $g \in G$ denotes a group element and $\rho: G \rightarrow T$ a group representation mapping $g$ to a concrete transformation\footnote{$T \subset GL(V)$ denotes a subset of the total set of linear invertible transformations on some vector space $V$.}. For instance, if we define $G = SO(2)$ as the group of planar rotations, then $g$ might represent a particular rotation angle, and $\rho(g)$ the rotation of $\rvx$ by that angle via matrix multiplication. Given such geometric data transformations, desirable properties for some predictor $f_{\vtheta}$ can include \emph{(i)} preserving the symmetry structure of $G$ by commuting with group actions, \ie~being \emph{equivariant}, or \emph{(ii)} ensuring robustness to group actions by remaining \emph{invariant} to them. Specifically, $f_{\vtheta}$ is deemed group equivariant if for all $g \in G$ we have that
\begin{equation} 
    f_{\vtheta}(\rho(g) \cdot \rvx) = \rho'(g) \cdot f_{\vtheta}(\rvx),
\label{eq:equivariance}
\end{equation}
where $\rho(g)$ and $\rho'(g)$ act on the data input space $\gX$ and output space $\gY$, respectively. Thus the model's output commutes predictably with the applied transformation, a property frequently employed, for instance, in translation-equivariant convolutional models for image processing. In contrast, if $\rho'(g) = \mathbb{I}$ equates the identity transformation for any group element $g$, then $f_{\vtheta}$ is group-invariant to $G$. This property is desirable if input samples $\rvx$ are subject to geometric data transformations or shifts, but we desire $f_{\vtheta}$ to provide consistent prediction outputs regardless. In neural network models, both properties are typically achieved by employing architectures that inherently incorporate \autoref{eq:equivariance} as a constraint, or through explicit or implicit learning of symmetries, \eg~via data augmentation (see \autoref{sec:related-work}).

\subsection{Equivariance via Canonicalization}
\label{subsec:background-canon}

Instead of designing a model and its layers to be equivariant, one may also obtain equivariance through \emph{canonicalization} \citep{mondal2023equivariant, kaba23equivariance}. At its core, canonicalization aims to learn a mapping from potentially transformed data to its standardized or canonical orientation before processing by the predictor. The approach separates the tasks of correcting and predicting for transformed data, greatly increasing flexibility by allowing the use of \emph{non}-equivariant pretrained predictors within an equivariant framework. More formally, given a predictor $f_{\vtheta}$ we additionally consider a learnable \emph{canonicalization network} (CN) as $c_{\vtheta}: \gX \rightarrow G$, and denote the canonicalization process as 
\begin{equation}
f_{\vtheta}(\rvx) = \rho'(c_{\vtheta}(\rvx)) \cdot f_{\vtheta}\left(\rho(c_{\vtheta}(\rvx)^{-1}) \cdot \rvx \right).
\label{eq:canon-equiv}
\end{equation}
The CN $c_{\vtheta}$ aims to predict the (inverse) group element to map $\rvx$ back to its canonical form, and \autoref{eq:canon-equiv} ensures $f_{\vtheta}$ is $G-$equivariant if $c_{\vtheta}$ itself is $G-$equivariant \citep{kaba23equivariance}. Similarly, for invariance we have $\rho'(g) = \mathbb{I}$ and \autoref{eq:canon-equiv} simplifies to 
\begin{equation}
f_{\vtheta}(\rvx) = f_{\vtheta}\left(\rho(c_{\vtheta}(\rvx)^{-1}) \cdot \rvx \right) = f_{\vtheta}\left(\hat{g}^{-1} \cdot \rvx \right),
\label{eq:canon-inv}
\end{equation}
where we've omitted $\rho$ since there is no ambiguity on the group action space\footnote{And subsequently abuse notation for simplicity and use $g \cdot \rvx$ as the application of $g$ on the domain directly.}, and $\hat{g}$ denotes the predicted group element using $c_{\vtheta}$. Whereas the original formulation by \cite{kaba23equivariance} directly predicts a single group element ${\hat{g} = c_{\vtheta}(\rvx)}$, \cite{mondal2023equivariant} extend the approach to predict a group distribution $\hat{P}_{G\mid\rvx}$ over transformations, in which case ${\hat{g} \sim \hat{P}_{G|\rvx}}$ can be sampled.

\paragraph{Regularization using the canonicalization prior.} There are practical challenges in ensuring that the learning process of the CN is both coupled to the employed predictor and the correct poses in the data. Thus, \cite{mondal2023equivariant} propose training the CN with a double objective of the form ${\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \beta \cdot \mathcal{L}_{\text{prior}}}$, where $\mathcal{L}_{\text{task}}$ is a cross-entropy loss term and $\mathcal{L}_{\text{prior}}$ a regularization term. In particular, if $f_{\vtheta}$ is pretrained and fully frozen during training the task loss is zero, and an additional learning signal becomes necessary. Thus, the \emph{canonicalization prior} (CP) term $\mathcal{L}_{\text{prior}}$ is introduced to align the CN's learned poses with the canonical pose prevalent in the data $\gD_{can} \sim P_{can}$ used to learn the CN (\eg~a hold-out data split). The loss is then given by 
\begin{equation}
    \mathcal{L}_{\text{prior}} = \mathbb{E}_{P_{can}}[D_{KL}(P_{G \mid \rvx} \;||\; \hat{P}_{G \mid \rvx})],
\label{eq:canon-prior}
\end{equation}
where $P_{G \mid \rvx}$ is a prior distribution for the group elements acting on samples in $\gD_{can}$, and $D_{KL}$ the Kullback-Leibler divergence. In practice the prior is usually set to $P_{G \mid x} = \delta(e)$, \ie, full probability mass on the identity element, thus assuming the `correct' data is subject to no transformations. This additionally simplifies computation of \autoref{eq:canon-prior} for particular groups, \eg~for discrete rotations we obtain $\mathcal{L}_{\text{prior}} = - \mathbb{E}_{P_{can}} \log \hat{P}_{G \mid \rvx}(e)$, the negative log probability of the identity element \citep{mondal2023equivariant}. Note that $G$ still needs to be defined beforehand, \ie~the CN learns a distribution \emph{over} group elements, rather than a set of valid group elements themselves (from a possibly infinite space). However, we find that results are not overly impacted when the correct group is a subgroup $G' \subset G$ of the model-specified group (\eg, $C4$ rotations rather than $C8$ rotations), providing some leeway to misspecification (see \autoref{tab:cifar100-robust}).
