
We next empirically validate our three different approaches to integrating geometric information with conformal procedures. We briefly outline our experiment design, with further details and results in the Appendix. Our code is publicly available at \url{https://github.com/computri/geometric_cp}.

\paragraph{Experimental design.} As outlined in \autoref{subsec:background-canon}, the canonicalizer is usually trained using a joint task and prior regularization loss ${\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \beta \cdot \mathcal{L}_{\text{prior}}}$. In accordance with conformal prediction we desire a fully \emph{post-hoc} approach amenable to pretrained predictors, and as such leverage canonicalizers trained exclusively with the canonicalization prior via \autoref{eq:canon-prior}. Consequently, the predictor in our experiments is \emph{pretrained and frozen} when used in conjunction with the CN, whereas relevant data augmentation and equivariant baselines require $\hat{f}_\theta$ to be trained from scratch. We emphasize that our approach leverages the exact same prediction model $\hat{f}_\theta$, pretrained without augmentations. 

Given various classification tasks, we employ the popular \textit{Adaptive Predictive Sets} (APS) \citep{romano2020classificationvalidadaptivecoverage} as our default nonconformity scoring approach for any conformal procedure, and report results for an alternative scoring method by \cite{sadinle2019least} in \autoref{app:exp}. Following standard practice we report \emph{empirical coverage} and \emph{mean set size} as our metrics to assess the quality of uncertainty estimates \citep{shafer2008tutorial, angelopoulos2024theoretical}. Empirical coverage determines the \emph{validity} of our guarantees by comparing to the target coverage level $(1-\alpha)$, whereas prediction set sizes assess the \emph{efficiency} of the method, and lower set sizes are more informative. We dub the proposed approach CP$^2$, for the combined use of the canonicalization prior (CP) with conformal prediction.


\subsection{Robustness to Geometric Data Shifts}
\label{subsec:exp-robust}

We assess the method's robustness to three geometric shifts caused by $C4$, $C8$, and $SO(3)$ rotation groups, and across three datasets (CIFAR-10, CIFAR-100, and ModelNet-40) and two data modalities (images and point clouds). 

\input{tab/robust_shift_pointcloud_aps}

\paragraph{Image classification.} We evaluate two ResNet-50 predictors on CIFAR-10 and CIFAR-100 samples subjected to $C4$ and $C8$ rotation shifts. These groups form discretized subgroups of $SO(2)$ with four and eight equidistant elements, respectively. Three model training configurations are considered: \emph{(i)} the prediction models trained in a default, non-augmented manner; \emph{(ii)} the same predictors trained with relevant data augmentations to obtain approximate invariance; and \emph{(iii)} pretrained and frozen predictors with `bolt-on' canonicalization models trained for $G=4$ and $G=8$ group elements. Each configuration is subsequently combined with standard SCP to provide prediction sets with a target coverage rate of $(1-\alpha)=95\%$.

Classification accuracy and conformal results for CIFAR-100 are given in \autoref{tab:cifar100-robust} (see \autoref{tab:cifar10-robust} for CIFAR-10). For non-shifted data, performance remains comparable. While the base predictor exhibits highest accuracy in that setting, it lacks generalizability under geometric shift, reflected by its poor performance and uninformative set sizes. In contrast, both data-augmented and canonicalization approaches ensure robustness to the shift, while achieving similar accuracy as in the non-shifted setting unless the learned group is misspecified (\ie, trained for $C4$ but exposed to $C8$). We observe that in the inverse case robustness continues to hold, thus suggesting to favour a broader group definition when faced with the risk of unknown group elements. That is, ideally the learned group is chosen to be maximal within constraints on computational resources and accuracy requirements, since a coarser discretization will induce more discretization artifacts. Overall, our results highlight canonicalization as a light-weight alternative to ensure robustness without necessitating retraining. 

\paragraph{Point cloud classification.} Unlike 2D images, point clouds exist within a continuous 3D space where rotational shifts are more intrinsic. We evaluate the performance of popular point cloud classifiers PointNet \citep{qi2016pointnet} and DGCNN \citep{wang2018dgcnn} with and without canonicalization, along with Rapidash \citep{vadgama2025utilityequivariancesymmetrybreaking}, a recent proposal which permits adjustable levels of equivariance---from non-equivariant to fully equivariant. Our results in \autoref{tab:pointcloud-robust} echo those from the image domain, revealing that unadjusted base models fail to maintain robustness against orientation shifts in point clouds, resulting in inflated conformal metrics. Conversely, models equipped with data augmentation or equivariance properties demonstrate better resilience to these geometric shifts. In particular this includes canonicalization, which in this particular instance trains a network by \emph{multiple magnitudes} smaller than other approaches (see \autoref{app:details-robust} for architecture details). In addition, data augmentations become substantially more expensive to incorporate due to the high degrees of freedom offered by 3D spatial rotations.

\input{fig/fig_mcp_cov_one}


\subsection{Diagnostics for Conditional Coverage}
\label{subsec:exp-condcover}

Next, we leverage the geometric information obtained from the canonicalization network's sample-wise group distributions to construct partition-conditional group distributions $\hat{P}_{G|k}$ following \autoref{eq:conditional_distribution}, and visualize the obtained `group maps' for CIFAR-10 in \autoref{fig:class_conditional_mondrian}. In each column, we display the true group distribution $P_{G|k}$---tractable by manually inducing different partition-conditional shifts---and the recovered distribution $\hat{P}_{G|k}$ using the CN. Indeed, we find that the model can effectively uncover meaningful geometric patterns when particular shifts are imbued on the data. We also visualize the class-conditional group map on the data \emph{without} any geometric shift (\autoref{fig:class_conditional_mondrian}, first column) and observe how samples across all classes are predominantly mapped into the identity element, \ie~upright. We can interpret the approach as a visual test for exchangeability, assessing whether all bucketed samples across the partition adhere to the same geometric properties (as is the case here).

\input{fig/fig_wcp}

Additionally, we may determine that particular partitions correlate with particular group elements, and in such cases leverage sample assignments to each entry $\hat{P}_{g \mid k}$ as an unsupervised proxy for mondrian conformal prediction. While we conformalize directly for the group partition (see \autoref{fig:app-partition_coverage}), the captured geometric relationship will \emph{by proxy} lead to more balanced coverage for the associated data partition. We demonstrate this for the class-conditional case in \autoref{fig:partition_coverage}, where MCP applied to the $C8$ group elements---which exhibit a correspondence with CIFAR-10 class labels---substantially improves per-class coverage over SCP due to a better-tailored conformal quantile estimate. Naturally, the proxy relationship is limited by the extent to which a partition-conditional group pattern intersects with multiple partitions, and at the other end no improvements are obtained when partitions exhibit identical group distributions (\autoref{fig:target_partition_coverage}, left). 

\subsection{Weighting for Double Shift Settings}
\label{subsec:exp-weightcp}

Finally, we evaluate the double shift setting described in \autoref{subsec:method-usecases} and \autoref{tab:wcp-settings} (third row). \autoref{fig:weighting} depicts the encounter of $C4$ and $C8$ discrete rotation shifts on $\gD_{cal}$, with a gradual interpolation for different continuous $SO(2)$ shifts (by adjusting data sampling probabilities) on $\gD_{test}$. Thus, the secondary test shift ranges from benign (\ie~no new group elements) to severe, where the uniform $SO(2)$ group places much probability mass on previously unknown rotations. We employ our geometric weighting scheme in conjunction with weighted conformal prediction, and compare against the standard variant. We clearly observe a coverage breakdown with growing geometric difference between data partitions, since the necessary exchangeability condition between $\gD_{cal}$ and $\gD_{test}$ is invalidated. Yet, geometric weighting can help delay the effect at the cost of enlarged set sizes, suggesting partial group knowledge can be beneficial to robustness even under \emph{unkown} group actions. However, improvements remain bottlenecked by the static training performance of the canonicalizer, and a more practical deployment should consider an updating step to incorporate new geometric information upon arrival. 
