\section{Experiments}\label{sec:experiments}

In our investigation of time series classification, we compare the generalized smoothed classifier, as defined in \autoref{sec:temporal_smooth_classifier} with the \textit{vanilla} classifier. 
We dub the classifier \textit{Temporal Smooth Conformal Predictor} (TSCP) since our main goal is to understand how smoothing the input with time series native perturbation (see \autoref{app:perturbations}) affects the coverage and accuracy of CP.
In the first experiment, we examine adversarial attacks to measure the robustness of TSCP against intentionally crafted disturbances.
Next, we explore domain generalization to assess how well the models adapt to different operational settings. 
These experiments are structured to provide a clearer understanding of the relative advantages of TSCP in managing the dynamic and noisy conditions typical of vehicle operation, and to identify a suitable architecture for time series classification.

\subsection{Settings}\label{exp:settings}

In our analysis, we consider two datasets: the UCR time series classification archive~\cite{UCRArchive} composed of $128$ time series datasets and an in-house dataset composed of $7$ input signals resulting from vehicle sensors. After an initial binning, a single data point contains $500$ time steps, resulting in a data snippet of $x\in\R^{7 \times 500}$, which is then classified into two classes $y\in\{0,1\}$.
As classifiers, we employ two distinct neural network architectures: a convolutional neural network (CNN) for the UCR datasets and a time series transformer for the in-house dataset. 
Additional details are reported in \autoref{app:settings}.

\paragraph{Coverage}

In the context of conformal prediction, coverage is a pivotal metric that measures the accuracy of the predictive model's confidence intervals. 
Essentially, coverage represents the proportion of times the true labels fall within the prediction intervals generated by the model. 
Formally, the coverage can be expressed as 
\begin{equation}
    Coverage = \frac{1}{|\set{D}_{test}|} \sum_{(x, y)\in\set{D}_{test}} \mathds{1}(y \in \set{C}(x)), 
\end{equation}
where $\set{C}(x)$ denotes the conformal prediction set generated by the model, $\mathds{1}$ is the indicator function, which equals 1 if the condition $y \in C(x)$ is true, and 0 otherwise.


\paragraph{Hardware Resources}
The experiments were conducted utilizing a server with four NVIDIA A100 GPUs and an AMD EPYC 7542 32-Core CPU.



\subsection{Adversarial Robustness on UCR}\label{sec:adversarial_robustness}


In this section, we explore the effectiveness of different classifiers in a white-box setting, focusing on their susceptibility to evasion attacks. 
Using Projected Gradient Descent (PGD)~\cite{carlini2017towards}, we assess the robustness of three prediction techniques: CP~\cite{vovk2005algorithmic}, RSCP~\cite{gendler2021adversarially}, and our method (TSCP). 
Our analysis delves into each method's accuracy, coverage, and prediction set-size under varying levels of adversarial perturbations.
In this context, we performed 20 uniformly distributed PGD~\citep{carlini2017towards} attacks within $\epsilon \in [0, 0.1]$, for 40 iterations and a step size of $\epsilon\times 10^{-1}$.
In the context of RSCP and TSCP, we consider 2000 samples.
The primary objective is to shed light on how standard and smooth classification approaches behave when faced with increasingly intense adversarial samples, thereby evaluating their overall robustness.

\begin{figure}
    \centering
    \includegraphics[width=0.45\textwidth]{figs/adversarial/Earthquakes_random_0.2.pdf}
    \vspace{-1em}
    \caption{Top-1 accuracy, coverage and set-size comparison between CP~\citep{vovk2005algorithmic}, RSCP~\citep{gendler2021adversarially} and TSCP (our) under increasing $\ell_\infty$-norm adversarial perturbations with budget $\epsilon$ for the earth-quakes UCR dataset~\citep{UCRArchive}.
    TSCP and RSCP have $\sigma = 0.2$.}
    \vspace{-1em}
    \label{fig:adversarial_attack}
\end{figure}

\autoref{fig:adversarial_attack} presents top-1 accuracy, coverage and set-size of CP, RSCP, and TSCP under escalating $\ell_\infty$-norm perturbations, denoted by $\epsilon$, for one dataset of the UCR datasets~\citep{UCRArchive}.
We observe that RSCP and TSCP mantains higher coverage for increasing $\epsilon$ values, despite having different accuracy.
This highlights the robustness of a smooth classifier against adversarial perturbations.
However, the set-size increases slightly with increasing $\epsilon$, indicating larger prediction sets.
Overall, TSCP tends to outperform in accuracy and coverage, but all methods show a degree of decline in performance with increasing adversarial budget.


\begin{table}[htb]
     \centering
     \caption{Comparison analysis of CP~\citep{vovk2005algorithmic}, RSCP~\citep{gendler2021adversarially} and TSCP (our) across UCR~\citep{UCRArchive} datasets. 
     We consider an average of 20 uniformly distributed PGD~\citep{carlini2017towards} attack samples with $\epsilon\in [0, 0.1]$ and a target coverage of 90\% ($\alpha = 0.1$).
     RSCP and TSCP are augmented by $\sigma = 0.2$.
     The complete version is available in \autoref{app:adversarial_attack}.
     }
     \label{tab:ucr_adversarial}
     \vspace{-0.5em}
     \adjustbox{width=0.45\textwidth}{%
     \begin{tabular}{lrrrrrrr}
         \toprule
         \textbf{Dataset} &\multicolumn{3}{c}{\textbf{Coverage}} &\multicolumn{3}{c}{\textbf{Set-Size}} \\
                 &CP &RSCP &TSCP &CP &RSCP &TSCP \\
        \midrule
        \midrule
         ArrowHead &78.7 &97.5 &99.4 &0.83 &2.41 &2.53 \\
         BME &100.0 &100.0 &100.0 &0.47 &3.00 &3.00 \\
         Beef &71.6 &100.0 &100.0 &0.40 &4.57 &4.57 \\
         BirdChicken &78.2 &100.0 &100.0 &0.74 &2.00 &2.00 \\
         CBF &97.2 &100.0 &100.0 &2.14 &3.00 &3.00 \\
         Car &78.4 &100.0 &100.0 &0.95 &4.00 &4.00 \\
         Chinatown &92.2 &100.0 &100.0 &0.69 &2.00 &2.00 \\
         CinC-ECG &88.6 &99.9 &98.2 &0.52 &3.98 &3.84 \\
         Coffee &56.4 &99.1 &53.9 &0.61 &1.94 &0.99 \\
         Cricket-X &85.4 &99.7 &100.0 &1.22 &10.54 &10.19 \\
         Cricket-Z &85.0 &99.4 &99.7 &1.94 &10.20 &9.70 \\
         Diatom Red. &75.0 &99.9 &99.3 &0.69 &3.99 &3.94 \\
         Distal Age &92.1 &100.0 &100.0 &1.57 &2.93 &2.96 \\
         Distal Correct &93.3 &99.8 &100.0 &1.72 &1.97 &1.97 \\
         Distal TW &90.6 &100.0 &100.0 &2.17 &4.48 &4.35 \\
            &\vdots &\vdots &\vdots &\vdots &\vdots &\vdots \\
         Toe Seg. 1 &79.2 &99.4 &95.4 &0.89 &1.81 &1.64 \\
         Toe Seg. 2 &93.9 &98.7 &95.6 &0.66 &1.45 &1.38 \\
         Trace &90.1 &100.0 &100.0 &0.68 &3.14 &3.18 \\
         TwoLeadECG &75.0 &95.4 &99.7 &0.44 &1.81 &1.92 \\
         Two-Patterns &100.0 &100.0 &100.0 &0.78 &3.80 &3.77 \\
         UMD &99.7 &100.0 &100.0 &1.71 &3.00 &3.00 \\
         UWave All &95.8 &99.9 &100.0 &0.80 &6.91 &7.37 \\
         Synt. Control &99.4 &100.0 &100.0 &0.90 &2.84 &2.78 \\
         uWave-X &91.3 &99.6 &99.6 &1.34 &5.47 &5.17 \\
         uWave-Z &85.8 &99.9 &99.9 &1.33 &7.06 &7.12 \\
         Wafer &99.8 &100.0 &100.0 &0.88 &1.93 &1.94 \\
         Yoga &72.4 &99.9 &99.8 &1.04 &1.98 &1.96 \\
         \midrule
         Overall &85.9  &98.7  &\textbf{98.0}  &\textbf{1.09}  &4.32  &4.28  \\
        \bottomrule
     \end{tabular}}
     \vspace{-1em}
 \end{table}

In \autoref{tab:ucr_adversarial}, we present a comprehensive comparison between CP, RSCP and TSCP across various datasets from the UCR archive (complete version in \autoref{app:adversarial_attack}). 
This assessment was limited to models that demostrate a minimum clean test top-1 accuracy of 70\% or higher.
We highlight the method that reaches a coverage as much close as possible to the target one $90\%$ ($1-\alpha$) and the target set-size which corresponds to 1.
Notably, we prioritize methods that consistently achieve at least the desired coverage of 90\%. 
This minimum threshold ensures a baseline level of robustness, as methods falling below this level are considered less reliable in the face of adversarial manipulation.
The \textit{Overall} row encapsulates the average performance and provides an aggregated view of the adversarial robustness across all datasets.
In general, the performance of TSCP and RSCP are close each other.

We observe that despite augmenting the set-size, both the RSCP and TSCP methods successfully create conformal sets that adapt to changes in data distribution and maintain a coverage level above the target of 90\%.
For some of the data sets,  both methods achieve 100\% coverage. 
This might be attributed to the relatively small size of the validation and calibration sets (20\% of the dataset, respectively). 
Few data points in the calibration and validation sets lead to overly cautious decision thresholds and fluctuations in coverage estimates.
Therefore, while we observe 100\% coverage in many cases, the actual coverage on unseen data might be slightly lower.
Importantly, TSCP tends to produce smaller confidence sets while still ensuring high coverage, which aligns more closely with our overall goal.


\subsection{Domain Generalization in Vehicle Sensor Data}\label{exp:domain_generalization}
% Generalized perturbations 

\begin{figure}
    \centering
    \includegraphics[width=0.45\textwidth]{figs/domain generalization/CP_RSCP_TSCP.pdf}
    \vspace{-0.5em}
    \caption{Coverage comparison between temporal smooth and plain CP in a domain generalization context.}
    \label{fig:domain_generalization}
    \vspace{-1em}
\end{figure}


Here, we conducted a comparative analysis of CP, RSCP and TSCP using our internal dataset derived from vehicle sensor data.
In this context, we train the time series transformer using portions of data from each distinct domain, each characterized by unique recording configurations.
Both the calibration and test sets consist of one or more configurations, each containing a minimum of 2\,000 data points. 
The training set encompasses the rest, amounting to a total of 32\,000 data points.


In \autoref{fig:domain_generalization}, we plot the accuracy, coverage and set-size for each specific test domain (configuration).
In this context, we consider a temporal-warping transformation ($\sigma=0.2$) for TSCP and jitter (Gaussian noise with $\sigma=0.2$) for RSCP.
We emphasize, with a dashed gray line, the goal of achieving target coverage at 0.9 ($\alpha=0.1$) and a target set size of 1. 
We observe that CP and TSCP demonstrate similar performance, whereas RSCP lags in certain configurations. 
Overall, for CP and TSCP, only 2 out of 19 configurations fall below the 0.9 threshold. 
This indicates that the target coverage for domain generalization is nearing its optimal level ($2/19 \simeq 0.1$).




In \autoref{tab:domain_generalization_plain_tests}, we provide an average of the domain generalization performance between CP, RSCP and TSCP across different transformations. 
Interestingly, the coverage level is fairly uniform across these different transformations, illustrating the classifiers' resilience in preserving their prediction accuracy despite the diversity introduced by these transformations. 
However, to some degree, the performance in accuracy varies among transformations, with window-warp showing a significant decrease.
This suggests that certain transformations can introduce complexities that challenge the classifier's ability to generalize. 
Regarding set-size, which reflects consistent outcomes, RSCP displays a surprisingly smaller value, yet it is in line with the results of the other methods.
With these results, we want to highlight the effects of induced transformations on classifier performance, showing that while some transformations can enhance generalization, others may introduce challenges, impacting accuracy and certainty in predictions.

\begin{table}[htb]
     \centering
      \caption{Comparison of domain generalization performance between vanilla and $\pi$-smoothed classifiers across various transformations ($\sigma=0.2$) in terms of accuracy (top-1), coverage and set-size. The values presented are averages calculated over the plain test sets of each individual configuration.}
    \label{tab:domain_generalization_plain_tests}
    \vspace{-0.5em}
     \adjustbox{width=0.48\textwidth}{%
     \begin{tabular}{llrrrrr}
         \toprule
         Method &Tranform. & Acc. & Coverage & Set-Size \\
         \midrule
         CP &Vanilla & 83.0 & 94.2 & 1.35 \\
         RSCP &Jitter & 80.5 & 93.0 & 1.28 \\
         TSCP &Scaling & 83.1 & 93.8 & 1.33 \\
         TSCP &Magnitude-Warp & 73.2 & 91.3 & 1.33 \\
         TSCP &Time-Warp & 74.3 & 91.9 & 1.32 \\
         TSCP &Window-Warp & 62.5 & 85.6 & 1.37 \\
         \bottomrule
     \end{tabular}}
     \vspace{-0.5em}
 \end{table}



\subsection{Discussion of Results}

In our analysis, spanning adversarial robustness and domain generalization, the study reveals notable insights into the performance of conformal methods such as CP, RSCP, and TSCP. 
Particularly in adversarial settings, the resilience of RSCP and TSCP shows that these methods maintain higher coverage against increasing adversarial perturbations, a practical demonstration of their robustness. 
This effect is offset by a modest rise in set-size, hinting at reduced precision, yet it still preserves both accuracy and coverage effectively.
TSCP generally performs well, often showing good accuracy and coverage.
Furthermore, in the real-world application of vehicle sensor data, the experiments demonstrate that temporal transformations are nearing optimal target coverage for domain generalization, though the accuracy decreases with stronger  transformations such as window-warping, pointing to challenges in this area.
These findings collectively demonstrate the potential of introducing native time series augmentation in environments susceptible to domain shifts and highlight the challenges in enhancing classifier robustness and accuracy across diverse configurations.