
We next motivate why canonicalization suits itself naturally for joint use with conformal prediction, including a perspective on data exchangeability. In \autoref{subsec:method-usecases} we then outline three ways to leverage obtained geometric information for conformal procedures under differing shift scenarios.

\paragraph{Practical motivation: flexible and efficient.} Equivariance modelling usually requires custom prediction models which embed the necessary geometric constraints deep within their architecture, such as via group convolutions with regular \citep{cohen2016group,Bekkers2020B-Spline} or steerable filters \citep{weiler2019general}. This introduces additional complexity into the model, complicates training, and can hamper the transferability of a solution across datasets or tasks. In contrast, canonicalization effectively decouples the prediction and equivariance components, permitting the use of a broader variety of non-equivariant, pretrained models for prediction, and ensuring equivariance in a \emph{post-hoc} step. This outsourcing permits the use of more efficient, light-weight equivariant models to learn the canonical mapping in an unsupervised way, while the complex prediction task is handled by a separate, usually substantially larger model (magnitudes larger, see \autoref{subsec:exp-robust}). This can also provide benefits over data augmentation, since only a single forward pass through the predictor is necessary. Most crucially, the obtained flexibility meshes particularly well with the conformal prediction framework, as CP's key advantage of \emph{post-hoc} compatibility with arbitrary `black-box' predictors is preserved. In that sense, we may think of canonicalization as a second `bolt-on' module, situated inbetween the predictor and uncertainty estimation via CP. Naturally, canonicalization has little to no effect on models that are \emph{already} symmetry-aware, as the additional module then becomes redundant.

\paragraph{Theoretical motivation: canonical mapping as data exchangeability.} We may also motivate canonicalization for CP from a more fundamental data perspective. Intuitively, the canonicalization network $c_{\vtheta}$ aids mitigate the predictor's performance loss due to encountered geometric shifts by enforcing data exchangeability with the training set, in turn benefitting uncertainty estimation. More formally, let us first define \emph{data exchangeability} following the CP framework:
\begin{definition}[Exchangeability \citep{shafer2008tutorial}]
    A sequence of random variables $\rvx_1, \dots, \rvx_n$ is exchangeable if for any permutation $\pi: \{1, \dots, n\} \rightarrow \{1, \dots, n\}$ with $n \geq 1$ we have that $P(\rvx_1, \dots, \rvx_n) = P(\rvx_{\pi(1)}, \dots, \rvx_{\pi(n)})$.
\vspace{-5mm}
\label{def:exch}
\end{definition}
That is, the joint data probability is invariant to sample ordering. In particular, observe how the \emph{i.i.d} setting is a special case where $P(\rvx_1, \dots, \rvx_n)$ factorizes. For conformal coverage guarantees along \autoref{eq:cp-guarantee} to nominally hold, \hyperref[def:exch]{Def.~\ref{def:exch}} only needs to be satisfied across calibration ($\gD_{cal}$) and test data ($\gD_{test}$), but \emph{not} necessarily for the predictor's training set ($\gD_{train}$). However, learned data properties that poorly translate to new (shifted) samples will result in a low-quality set of computed nonconformity scores, starkly inflating prediction set sizes and rendering obtained sets uninformative. From a data perspective, this issue can be alleviated if $\gD_{train}$ also approximately satisfies \hyperref[def:exch]{Def.~\ref{def:exch}}, and $f_{\vtheta}$ thus guarantees informative scoring. 

This is precisely what the CN attempts to ensure via its canonical mapping. Classical exchangeability imposes data invariance under permutations $\pi \in \mathbb{S}_n$, where $\mathbb{S}_n$ represents the set of all permutations in $\{1,\dots,n\}$. Assuming a shift by the group $G$ affecting $\gD_{cal}$, each calibration sample $\rvx_i$ is now also susceptible to an independent transformation ${g_i \in G}$. That is, on a dataset level we now aim for exchangability (\ie~group invariance) to extend to the group $G^n = G \times G \times \dots \times G$, in which each sample experiences a potentially different transformation of $G$. For every affected sample $g_i \cdot \rvx_i$, the CN ensures the existence of an inverse transform $c_{\vtheta}(g_i \cdot \rvx_i)^{-1}$ which neutralizes $g_i$. That is, proper canonicalization maintains the relationship ${c_{\vtheta}(g \cdot \rvx)^{-1} = c_{\vtheta}(\rvx)^{-1} \cdot g^{-1}}$ for all $g \in G$ and inputs $\rvx$ \citep{kaba23equivariance}. Under the action of $G^n$, we then observe for the joint distribution that
\begin{multline*}
    P(c_{\vtheta}(g_1 \cdot \rvx_1)^{-1} \cdot g_1 \cdot \rvx_1, \,\dots\,, c_{\vtheta}(g_n \cdot \rvx_n)^{-1} \cdot g_n \cdot \rvx_n) \\
    = P(c_{\vtheta}(\rvx_1)^{-1} \cdot \rvx_1, \,\dots\,, c_{\vtheta}(\rvx_n)^{-1} \cdot \rvx_n) \, ,
\end{multline*}
ensuring that the distribution over canonicalized samples remains invariant under $G^n$. This generalizes the classical exchangeability definition of \hyperref[def:exch]{Def.~\ref{def:exch}} to include both dataset permutations and sample-wise transformations, enlarging the symmetry group from $\mathbb{S}_n$ to $\mathbb{S}_n \times G^n$. Since distributional invariance to $G$ implies that transformations from $G$ do not alter the joint distribution, the CN effectively enforces \textit{probabilistic symmetry} (\cite{bloem2020probabilistic}, Prop. 1). Thus, it guarantees well-calibrated nonconformity scores practically useful for CP even under geometric shifts.


% -------- End of ERIK's version

% ----- NOTES

% \begin{itemize}
%     \item Let's provide another theoretical motivation for the use of CN. Intuitively, CN helps establish data exchangeability to the training data in order to mitigate performance effects of the geometric shift on predictor.
%     \item Consider the exchangeability definition. For conformal guarantees, we only need \autoref{def:exch} to hold for $\gD_{cal}$ and $\gD_{test}$, and do not require it also hold for $\gD_{train}$.
%     \item But, if $f_{\vtheta}$ trained on $\gD_{train}$ is exposed to the shifted data, its learned data properties do not generalize well and so our CP procedure will be very inefficient, and obtained uncertainty will be not very useful
%     \item Mitigating the distribution shift can then be achieved by model fine-tuning or test-time adaptation etc., but from a data perspective we can also alleviate the issue if we can ensure that $\gD_{train}$ is also (approx.) exchangeable, thus aligning predictor's learned data properties
%     \item The CN is doing just that. Let's consider $G_{train} = \{e\}$ and the prior $P_{G \mid \rvx} = \delta(e)$ as (correct) canonical forms, and again omit $\rho$ assuming that $g \cdot \rvx$ is well-defined (\ie~since we focus on $e$ and $\rho(e) \cdot \rvx = \mathbb{I} \cdot \rvx = \rvx$). 
%     \item Let's observe that some train sample $\rvx_{train}$ is subject to the identity element only, so we can trivially write $\rvx_{train} = e \cdot \rvx_{train}$, whereas some calibration sample is acted upon as $g \cdot \rvx_{cal}$.
%     \item Then the CN is attempting to find the inverse element such that $c_{\vtheta}(\rvx_{cal})^{-1} \cdot g = g^{-1} \cdot g = e$, and thus we can observe for the joint distribution of $\rvx_{train}$ and $\rvx_{cal}$ under respective group actions that $P(e\cdot\rvx_{train}, g^{-1} \cdot g \cdot \rvx_{cal}) = P(e \cdot\rvx_{train}, e \cdot \rvx_{cal}) = e \cdot P(\rvx_{train}, \rvx_{cal}) = P(\rvx_{train}, \rvx_{cal})$ is distributionally invariant (Prop.1, \cite{bloem2020probabilistic}). Observe in particular that this holds for actions by any $g \in G$.
%     \item Linking it more directly to exchangeability, consider the finite symmetric group $\mathbb{S}_n$ representing the set of all permutations in $\{1,\dots,n\}$, and a group element $\pi \in \mathbb{S}_n$ as a particular permutation of $\rvx_1, \dots, \rvx_n$. Then distributional invariance (extended to $n$ samples) to any permutation $\pi \in \mathbb{S}_n$ directly equates exchangeability as formulated in \autoref{def:exch}, also referred to as \emph{probabilistic symmetry} by \cite{bloem2020probabilistic}.
%     \item That is, the CN's canonical mapping ensures distributional invariance, which for the particular group $\mathbb{S}_n$ can be interpreted as ensuring data exchangeability.
%     \item This invariance property remains unaffected by a symmetric and point-wise function application, such as the predictor $f_{\vtheta}$. In other words, the predictor's performance is unaffected by sample ordering, and its generalization properties naturally extend to any data exchangeable with its training samples. Thus we motivate how the CN's map to a (predictable) canonical data form ensures data exchangeability amenable to the predictor, effectively mitigating the geometric data shift and thus improving obtainable uncertainty estimates under the CP framework.
%     \item Canon attempts to make data exchangeable by a learnable mapping, whereas something like data augmentation makes approx. exchangeable by expanding the data sequence observed by $f_{\vtheta}$, and inbaked invariance attempts to directly ensure feature encoding or pred output invariance through hard-coded group transformations without addressing it from the input data aspect. \cite{kaba23equivariance} also distinguish these forms as single- or multi-view perspectives.
%     \item \cite{bloem2020probabilistic} further motivate the use of the canonicalization principle as a `representative equivariant' function $\tau(\rvx)$ which maps samples into a representative element (the maximal invariant) of the sample's group \emph{orbit}\footnote{the set of elements to which $\rvx$ can be mapped by the group action}, thus certifying the equivariance properties of canonicalization.
% \end{itemize}

% FROM \cite{kaba23equivariance}:\\
% \cite{bloem2020probabilistic} introduce the concept
% of representative equivariant; the set $\rho(G)\rvx = \{\rho(g)\rvx \mid \forall g \in G\}$ is the orbit of element $\rvx$. It is the set of elements to which $\rvx$ can be mapped by the group action. The set of orbits denoted by $\gX /G$ forms a partition of the set $\gX$. The invariance requirement on a function $\phi$ amounts to having all the members of a group orbit mapped to the same image by $\phi$. It is thus possible to achieve invariance by appropriately mapping all elements to a canonical orbit representative before applying any function. For equivariance, elements can be mapped to a canonical sample and, after a function is applied, transformed back according to their original position in the orbit. 

% The symmetric group Sn over a finite set of n elements contains all the permutations of that set. This group captures the inductive bias that input order should not matter. Sn-equivariant canonicalization functions can be obtained with a direct approach using existing optimal transport solvers ... can also framed as an optimization problem, which makes our optimization approach relevant;

% from \cite{bloem2020probabilistic}:\\
% In Bayesian statistics, the canonical probabilistic symmetry is exchangeability. is exchangeable if its distribution is invariant under all permutations of its elements. 

% the empirical measure is a sufficient statistic for models comprised of exchangeable distributions. 

% invariance equals uniform probability on the orbit; canonicalizaton equals 'representative equivariant' $\tau(x)$ which maps into the maximal invariant of the orbit, which is a reprsentative element of the orbit; helps certify equivariance properties of canon

% Exchangeability is distributional invariance under the action of Sn (or other groups defined by composing Sn
% in different ways for other data structures).



\subsection{Use Cases for Conformal Prediction}
\label{subsec:method-usecases}

Following our motivation, we now illustrate three interesting ways how obtained group information can be leveraged to benefit different conformal prediction procedures and tasks.

\input{fig/fig_cond_group_dist}

\paragraph{For general robustness to geometric data shifts.} We first directly demonstrate the obtained robustness to a geometric data shift at calibration and test time. To that end, we can simply combine the CN $c_{\vtheta}$ with a non-equivariant, pretrained predictor and apply standard split conformal prediction (SCP). Since the CN ensures the necessary exchangeable mapping to align the predictor's outputs with the conformal procedure, we expect a substantial improvement in prediction set sizes over directly using $f_{\vtheta}$ and SCP without canonicalization. 

\paragraph{As a diagnostics tool and proxy for conditional coverage.} Unlike inherently equivariant models or models trained with data augmentation, canonicalization provides us with explicit access to \emph{sample-wise} geometric information or pose via the group distributions $\hat{P}_{G \mid \rvx}$. These can be exploited to construct empirical group distributions pertaining to any separable data partition of interest, \eg~by class labels or feature properties. Such empirical group distributions can provide insights into the geometric poses under which a certain property or partition naturally occurs in the data (see \autoref{app:details-mcp} for further intution). If a partition's `group map'---visualized for some examined partitions in \autoref{fig:class_conditional_mondrian}---reveals informative geometric patterns, the group assignments can be subsequently leveraged to provide stronger partition-conditional or mondrian conformal prediction (MCP) guarantees (\autoref{eq:cp-guarantee-mond}). In that sense, the group information can be leveraged as a data diagnostics tool to uncover even \emph{a priori} unknown but geometrically informative partitions, or to suggest data exchangeability for a partition when no meaningful group pattern emerges. In principle, such group maps could be extended as far as incorporating multiple datasets to potentially uncover geometric shifts across new data sources.

Formally, given some data partition $k \in \{1, \dots, K\}$ of $\gD_{cal}$ into $K$ parts, an empirical group distribution for the $k$-th partition can be constructed as $\hat{P}_{G \mid k} = \{ \hat{P}_{g \mid k} \mid g \in G\}$, where $\hat{P}_{g \mid k}$ denotes the $g$-th element's estimated frequency computed as
\begin{align}
    \hat{P}_{g \mid k} &= \frac{\sum_{i=1}^{n}\mathbbm{1}(\hat{g}_i = g \,\wedge\, \phi(\vx_i, \vy_i) = k) }{\sum_{i=1}^{n}\mathbbm{1}(\phi(\vx_i, \vy_i) = k) }.
    \label{eq:conditional_distribution}
\end{align}
The indicator function is given by $\mathbbm{1}[\cdot]$, whereas $\hat{g}_i \sim \hat{P}_{G | \vx_i}$ is the sampled group element obtained for $(\vx_i, \vy_i)$. 

\paragraph{As a weighting scheme for double shift settings.} Consider a more complex \emph{double shift} setting, wherein the first shift between $\gD_{train}$ and $\gD_{cal}$ is addressed by the CN, but an additional second shift between $\gD_{cal}$ and $\gD_{test}$ occurs. For example, the CN trained on $\gD_{cal}$ learns to address a shift caused by the $C8$ rotation group, but test samples are susceptible to continuous rotations on $SO(2)$. In this case even the use of canonicalization with standard SCP can be insufficient to ensure conformal guarantees, since the CN can underperform when faced with new, unknown group elements (\ie~rotation angles in $SO(2)$ but not $C8$). However, the obtained group information can still be leveraged to inform \emph{geometric weights} for a weighted conformal prediction strategy (WCP). We posit that the CN assigns higher probability to group elements that are `closer' aligned with the test sample's unknown transformation, and as such provides information to upweigh more geometrically relevant calibration samples; we elaborate on this intuition in \autoref{app:details-wcp}. In conjunction with WCP, this may offer improved robustness against shifts with \emph{unknown} group elements. Different double-shift settings and approaches to establish robustness are outlined in \autoref{tab:wcp-settings}\footnote{We empirically examine the first (\autoref{subsec:exp-robust}) and last rows (\autoref{subsec:exp-weightcp}).}.

\input{tab/double_shift_explainer}
\input{tab/robust_shift_cifar100_aps}

Formally, given a test instance $\vx_{n+1}$, the $i$-th calibration sample's geometric relevance with respect to $\vx_{n+1}$ can be measured by ${D \bigl( \hat{P}_{G \mid \vx_{n+1}}, \hat{P}_{G \mid \vx_i} \bigl)}$, with $D$ being any distributional distance metric. Since we desire a small geometric distance between two samples to produce a large importance weight, the unnormalized weight $w_i$ can be defined by an inverse relation of the form $w_i(\vx_{n+1}) = 1/(1 + D^p)$, where $p$ denotes an additional parameter modulating the slope or skewdness of the weighting distribution. A final weight $\tilde{w}_i$ is then acquired by subsequent normalization.

% Note that we desire the geometric distance between two samples to be \emph{small} for a large weight, so we might want to take the inverse $1/d$ or a related idea. Finally, we need weights to be normalized for their use with WCP, \eg~by using sum normalization $\tilde{w}_i(x_{n+1}) = \frac{w_i(x_{n+1})}{\sum_{j=1}^n w_j(x_{n+1})}$ or softmax normalization.

% , or with $\hat{g} \sim \hat{P}_{G\mid x}$ as $d\bigl(\hat{g}_{n+1}, \hat{P}_{G\mid x_i} \bigl)$ or even $d\bigl(\hat{g}_{n+1}, \hat{g}_{i} \bigl)$ more coarsely