\section{Preliminaries}\label{sec:preliminaries}

Let us consider $x \in \set{X}$ as a multi-variate time series signal where $\set{X} \subseteq \mathbb{R}^{c\times T}$, with $c$ denoting the number of channels and $T$ specifying the signal's window size. 
Here, we are examining a \textit{soft} classifier, denoted as $F \colon \mathbb{R}^{c\times T} \to \prob(\set{Y})$,
where $\prob(\set{Y})$ is the set of probability distributions over $\set{Y}$, and $\set{Y} = \{0, 1, \dots, K\}$ is the collection of classification labels.
Thus, a soft classifier assigns each data point a distribution over classes, rather than just assigning it to a class. 
It is possible to convert any soft classifier $F$ into a hard classifier $f$ by mapping $f(x) \eqdef \argmax_{y\in\set{Y}} F(x)_y$.
In addition, we denote with $I$ the identity matrix, with $\set{N}(0, \sigma^2I)$ the standard normal distribution and with $\set{U}(a, b)$ the uniform integer distribution where $a,b \in \mathbb{Z}, a<b$.

\subsection{Conformal Prediction (CP)}

Introduced by \cite{vovk1999machine, vovk2005algorithmic} and \cite{papadopoulos2008inductive}, CP offers an intuitive approach to produce prediction sets that achieve a user defined confidence. 
In essence, given $n$ training samples ${(x^{(i)}, y^{(i)})}^n_{i=1}$, the objective is to predict the label $y^{(n+1)}$ for a test point $x^{(n+1)}$. 
%
Assuming the training and test samples come from an exchangeable source (i.i.d. distribution), CP methods create a prediction set $\set{C}(x^{(n+1)})$ that is likely to include the test label $y^{(n+1)}$ with a specified coverage, such as 90\% or 95\%. 
Formally, this is expressed as: 
\begin{equation}\label{eq:conformal_score}
    \prob[y^{(n+1)} \in \set{C}(x^{(n+1)})] \geq 1 - \alpha,    
\end{equation}
where \(\alpha\) represents the chosen error rate. 
%
Notably, this probability considers all training samples and the test point $x^{(n+1)}$ and is also known as
marginal coverage.
CP's core principle involves training a classifier on the dataset and subsequently assigning \textit{non-conformity scores} to validation data. 
Generally, lower prediction errors correspond to more concise and informative prediction sets.

\paragraph{(Non) Conformity Score}
The process initiates by dividing the training data into two distinct subsets: (i) a primary training set, denoted as $\set{D}_{tr}$ within the range of ${1, ..., n}$, and (ii) a calibration set $\set{D}_{cal}$, which is the remainder of the range after subtracting $\set{I}_{tr}$. 
A soft classifier, represented as $F (x)$ spanning $[0, 1]^K$, is then trained on the primary set to determine the conditional probabilities of each class $\prob[y | x]$ for every $y \in Y$. 
When using deep network classifiers, the subject of this study, this is typically the outcome of the softmax layer. 
Subsequently, a score function $S:\set{X}\times\set{Y} \to \mathbb{R}_{\geq 0}$ produces a \textit{(non) conformity score}, $S^{(i)} = S (x^{(i)}, y^{(i)})$ for each point in the calibration set. 
This score evaluates the coherence between the model's prediction $F(x)$ and the actual label $y$, where a smaller score denotes a closer match.

\begin{definition}[Conformal Prediciton Set]\label{def:prediction_set}
    Given the desired coverage level $1 - \alpha$, a prediction set $\set{C}$ for a new test point $x^{(n+1)}$ is defined as:
    \begin{equation}\label{eq:conformal_set}
        \set{C} = \left\{ y \in \set{Y}: S(x^{(n+1)}, y) \leq Q_{1-\alpha}(\{S^{(i)}\}_{i\in \set{D}_{cal}}) \right\},
    \end{equation}
    where $Q_{1 - \alpha} (\{S^{(i)}\}_{i \in \set{D}_{cal}})$ is defined as the $(1-\alpha)(1+\nicefrac{1}{(1+|\set{D}_{cal}|)})$-th empirical quantile of $\{S^{(i)}\}_{i \in \set{D}_{cal}}$.    
\end{definition}


In other words, \autoref{eq:conformal_set} involves scanning through all potential labels $y \in \set{Y}$ and adding to $\set{C}(x^{(n+1)})$ those predicted labels $y$ whose scores $S(x^{(n+1)}, y)$ are lower than the majority of calibration scores $S(x^{(i)}, y^{(i)}), \forall i \in \set{D}_{cal}$~\citep{vovk2005algorithmic}. 

\subsection{PAC Prediction Set}\label{sec:pac}

Our goal is to find conformal sets that are not only as compact as possible, but also highly reliable, adhering to the principle of being \textit{probably approximately correct} (PAC)~\citep{valiant1984theory}.
Formally, considering an \textit{algorithm} $\set{A}$ that takes a set of calibration data $\set{D}_{cal} \subset \set{D}_{tr}$ and generates a CP set $\set{C}$. Given $\gamma, \xi \in (0, 1)$, we consider $\set{A}$ is PAC if:
%
\begin{equation}\label{eq:pac_set}
    \prob_{\set{D}_{cal}\sim \set{D}_{tr}} \left[ L_{\set{D}_{cal}}(\set{C}) \leq \xi\;|\; \set{C} = \set{A}(\set{D}_{cal}) \right] \geq 1 - \gamma,
\end{equation}
%
where $L_{\set{D}_{cal}}(\set{C}) = \prob_{(x, y) \sim \set{D}_{cal}}\left[y \notin \set{C}(x)\right]$ is the true error.
The challenge lies in developing an algorithm $\set{A}$ that not only meets the PAC criteria but also constructs confidence sets $\set{C}(x)$ that are, on average, as minimal as possible.
In the context of machine learning, \cite{park2019pac, park2020pac} proposes to construct $\set{C}$ by parametrizing it with a scalar $\tau \in \set{T} \subseteq \mathbb{R}_{\geq 0}$ as:
%
\begin{equation}
    \set{C}_{\tau} = \left\{ y\in \set{Y}\,:\, S(x, y) \geq \tau \right\},
\end{equation}
%
where $\tau$ represents the threshold which controls the trade-off between size and expected error. 
Intuitively, they formulate this challenge into an \textit{empirical risk minimization} framework, where the objective is to minimize the size of the predicted confidence sets.
In practice, the goal is to find the maximum threshold value $\hat{\tau}$ such that the empirical error $\hat{L}_{\set{D}_{cal}}(\set{C}) = \sum_{(x,y)\in\set{D}_{cal}} \mathds{1}(y \notin \set{C}_\tau (x))$
remains within a certain confidence interval.
Formally, this is expressed as:
%
\begin{equation}\label{eq:pac_tau}
    \hat{\tau} = \max_{\tau \in \set{T}} \left\{ \tau \,:\, \hat{L}_{\set{D}_{cal}}(\set{C_\tau}) \leq k(m, \xi, \gamma) \right\},
\end{equation}
%
where the right-hand side of the inequality is the confidence level $k$ derived from the Binomial distribution as follows:
%
\begin{equation}\label{eq:pac_alpha}
    k(m, \xi, \gamma) = \max_{k \in \mathbb{N}_{0}} \left\{ k : \sum_{i = 0}^{k} \begin{pmatrix} m\\ k\end{pmatrix} \xi^i (1 - \xi)^{m-i} < \gamma \right\}.
\end{equation}
%
This approach is conceptually linked to the idea that the average loss behaves like a Binomial distribution.
By setting $\hat{\tau}$ in this manner, we aim to minimize the size of the confidence sets while ensuring that the empirical error stays within acceptable probabilistic bounds, thereby adhering to the PAC guarantee of \autoref{eq:pac_set}.


\subsection{Smoothed Conformal Prediction}

Initially introduced by \cite{cohen2019certified} and \cite{salman2019provably}, randomized smoothing computes the $\ell_2$-norm certificates around an input sample $x$ by counting which class is most likely to be returned when $x$ is perturbed by isotropic Gaussian noise.
\begin{definition}[Smooth Classifier]\label{def:smooth_classifier}
    Given a \textit{soft} classifier $F$, randomized smoothing considers a \textit{smooth} version of $F$ defined as:
    \begin{equation}\label{eq:smooth_classifier}
        G(x) \eqdef \E_{\delta \sim \set{N}(0, \sigma^2I)}\left[F(x + \delta)\right],
    \end{equation}
    where $\sigma > 0$ represents the standard deviation.
\end{definition}

\citet{cohen2019certified} demonstrated that $G$ is robust to perturbations of radius $R$, where the radius $R$ is defined as the difference in probabilities between the most likely class and the second most likely class.
Contrary to other formal verification methods, randomized smoothing does not make any assumptions regarding the model's properties, allowing certification to be scaled to larger and more complex networks.

\paragraph{Smoothed Score}
Interestingly, this inherent robustness offered by randomized smoothing served as an additional layer to address challenges in conformal predictions~\citep{gendler2021adversarially, pmlr-v216-ghosh23a}.
Sets formed by the basic conformal method may not ensure accurate coverage, especially when real-world data breaches the exchangeability assumption due to frequent distribution shifts~\citep{tibshirani2019conformal, cauchois2020robust, gibbs2021adaptive}.
In a recent work, \cite{gendler2021adversarially} introduced a \textit{smooth} version of the original non-conformity score obtained by averaging the value of $S (x + \delta, y)$ over many independent samples.

\begin{definition}[Smooth Score]
Let $S:\set{X}\times\set{Y} \to \mathbb{R}_{\geq 0}$ be a scoring function. We define the smoothed score function as:
\begin{equation}
    \tilde{S}(x, y) \eqdef \Phi^{-1} \left( \E_{\delta \sim \set{N}(0, \sigma^2I)}\left[S(x + \delta, y)\right] \right),
\end{equation}
where $\Phi^{-1}$ is the inverse of the cumulative distribution function (quantile) of the standard normal distribution.
\end{definition}

As shown in \cite{salman2019provably} and \cite{gendler2021adversarially}, the local Lipschitz continuity derived from randomly smoothing the prediction, sets an upper-bound for the conformal score:
\begin{equation}\label{eq:smooth_coverage}
    \tilde{S}(\tilde{x}^{(n+1)}, y) \leq \tilde{S}(x^{(n+1)}, y) + \frac{R_\delta}{\sigma},
\end{equation}
where it holds for every $y\in\set{Y}$.
If we consider a distance metric between $\tilde{x}$ and $x$, such that $d(\tilde{x}, x) \leq \epsilon$, with $\epsilon > 0$, than for a Gaussian distribution $\delta \sim \set{N}(0, \sigma^2I)$ the radius $R_\delta$ corresponds to $\norm{\epsilon}_2$.