\section{Experiments}

The following experiments implement Adaptive Quantum Conformal Prediction (AQCP) (Algorithm \ref{alg:AQCP_batch_predict}) on a univariate multimodal regression task. We test local coverage properties of AQCP using data from the \(\mathtt{ibm\_sherbrooke}\) quantum processor. Additionally, we investigate the impact of the score function and the number of shots on prediction set size using data from a classical simulator.

Code and data for reproducing our experiments is available at: \url{https://github.com/doug-spencer/AQCP}. All \(\mathtt{ibm\_sherbrooke}\) shot data were collected on the $18^\text{th}$ of April $2025$.

\subsection{Experimental Setup}

To facilitate comparison with \cite{QuantumCP}, we replicate the conditions of their regression task, in which training, calibration, and test data are drawn i.i.d.\ from the following distribution:
\[
    X \sim \mathcal{U}(-10, 10), \qquad
    Y \mid X = x \sim \frac{1}{2}\Big(\mathcal{N}(-\mu(x), 0.05^2) + \mathcal{N}(\mu(x), 0.05^2)\Big),
\]
where $\mathcal{U}(-10,10)$ denotes the uniform distribution on $[-10,10]$, and $\mathcal{N}(m,\sigma^2)$ denotes the univariate normal distribution with mean $m$ and variance $\sigma^2$. The function $\mu(x)$ is defined as
\[
    \mu(x) = \frac{1}{2}\sin\left(\frac{4}{5}x\right) + \frac{x}{20}.
\]

\subsubsection{Quantum Model Architecture and Training}

To apply AQCP, a trained model must first be obtained. In the classical-data quantum-processing paradigm we follow, this requires specifying a parametrised quantum circuit (PQC), a classical data-encoding scheme, and an optimisation procedure. Background on this is provided in Section~\ref{sub: QML} and our approach parallels that of \cite{QuantumCP}.

The hardware-efficient ansatz (HEA) was selected for its versatility, problem-agnostic design, and hardware efficiency \citep{engineerQML}. We implemented the HEA using $Q=5$ qubits and $L=5$ layers applied sequentially to form the unitary operator
\[
U(\theta) \coloneqq U_5(\theta) \cdot U_{4}(\theta) \cdot U_3(\theta) \cdot U_{2}(\theta)\cdot U_1(\theta).
\]
Each layer was formed 
of an unparametrised entangling unitary, \(U_\text{ent}\), and parametrised single-qubit Pauli rotation gates:
\begin{align}
     U_l(\boldsymbol{\theta}) &\coloneqq U_\text{ent}\big(R_Z(\theta^1_{l,1})R_Y(\theta^2_{l,1})R_Z(\theta^3_{l,1}) \otimes \cdots \otimes R_Z(\theta^1_{l,Q})R_Y(\theta^2_{l,Q})R_Z(\theta^3_{l,Q}) \big),\quad l=1, \ldots,5,\nonumber\\[6pt]
     U_\text{ent} &\coloneqq \prod_{k=1}^{Q-1} C^Z_{k,k+1}.\nonumber
\end{align}
Here, \(C^Z_{k,k+1}\) denotes a controlled-\(Z\) gate between qubits \(k\) and \(k+1\), forming a linear entangling block (see Section~\ref{sub:ansatz}). Classical features were encoded using a learned non-linear angle encoding scheme with data re-uploading. Specifically, a neural network with architecture \((1,10,10,|\boldsymbol{\theta}|)\), where \(|\boldsymbol{\theta}|=3LQ=75\), maps each input \(x\) to the circuit rotation angles \(\boldsymbol{\theta}_{\boldsymbol W}(x)\). The network includes bias terms and uses Exponential Linear Unit (ELU) activations \citep{clevert2015fast}; \(\boldsymbol W\) denotes its trainable weights.

All PQC measurements were performed in the computational basis. Bitstring outcomes \(b\in\{0,1\}^Q\) were then mapped to a discrete real-valued grid via \(f: \{0,1\}^{Q} \rightarrow \mathbb{R}\), defined
\[
f(b) = y_{\min} + k\cdot \mathrm{bin}(b), \quad\text{with} \quad k = \frac{y_{\max} - y_{\min}}{2^Q - 1}.
\]
Here, \(\mathrm{bin}(b)\) converts bitstrings to their denary representation, and \([y_{\min}, y_{\max}]=[-1.5, 1.5]\) was chosen to contain all but a negligible fraction of the probability mass of the target distribution.

We trained the angle encoder parameters using the TorchQuantum framework \citep{hanruiwang2022quantumnas}, 
which integrates the construction and simulation of quantum circuits with PyTorch's automatic gradient computation. While this method of training would not be possible using quantum hardware, alternative methods such as the parameter-shift rule \citep{schuld2019evaluating,QuantumCP} are available.

TorchQuantum also provides direct access to the full measurement distribution of the PQC as a probability mass function over bitstrings, enabling the use of a multi-class cross-entropy loss. 
For a single training example $(x_i, y_i)$, this is defined as
\[
\ell\left(y_i, U(\boldsymbol{\theta}_{\boldsymbol{W}}(x_i))\right) 
\vcentcolon= -\log\left(\mathbb{P}(\hat{Y}=y_i \mid x_i)\right),
\]
where $\hat{Y}$ denotes the PQC measurement outcome. All training was performed on a noiseless simulator, 
so we do not condition on shot time in this instance.

The model was trained to minimise the empirical risk,
\[
     \hat{R}(\bm{W};\mathcal{D}_{\mathrm{tr}}) \coloneqq \frac{1}{n_\text{tr}} \sum_{i=1}^{n_\text{tr}} \ell\left(y_i, U\big(\bm{\theta}_{\bm{W}}(x_i)\big)\right),
\]
with \(\mathcal{D}_{\text{tr}} = \{ (x_i, y_i) \}_{i=1}^{n_\text{tr}}\), $n_\text{tr}=1{,}000$, a fixed learning rate of \(0.01\), and \(100\) epochs. 

Figure~\ref{fig:shotview} shows a scatter plot of points taken using the trained model on a noiseless simulator and using the $\texttt{ibm\_sherbrooke}$ backend. It is clear from the simulated samples that the model reasonably approximates the conditional distribution. While the fundamental structure of the model remains discernible, the samples from $\texttt{ibm\_sherbrooke}$ show the clear impact of hardware noise.

    \begin{figure}
    \centering
    \includegraphics[width=\linewidth]{Figures/figure_sinusoidal_M100_AER_vs_IBMQ.png}
    \caption{\textbf{Regression model shots (simulated vs. \texttt{ibm\_sherbrooke}).} Comparison of $100{,}000$ shots sampled from each backend (Qiskit Aer simulator and \texttt{ibm\_sherbrooke}). The marker size is scaled proportionally to the count of overlapping shots at each location. The red lines represent the component mean functions $\mu(x)$ and $-\mu(x)$.}
    \label{fig:shotview}
\end{figure}

\subsubsection{Algorithm Implementation and Evaluation Strategy}

To generate calibration and test data, we drew \(10{,}000\) samples from the target distribution. For each sample, we executed the circuit with the encoded parameters on \(\mathtt{ibm\_sherbrooke}\) for \(M=100\) shots. These \(10{,}000\) circuits were submitted to the device in batches of \(1{,}000\). 
In the efficiency studies, we collected shot data for $10{,}000$ samples, but from the classical simulator \(\mathtt{FakeQuitoV2}\). All data were collected through the Qiskit library \citep{qiskit}. 

When implementing AQCP, we used Algorithm~\ref{alg:AQCP_batch_predict} (with the optional update step included), meaning that conformity scores from each test point were appended to the calibration set after evaluation. All score functions were implemented as introduced in Section~\ref{sub: score functions}. \(k=\lceil \sqrt{M} \,\rceil\) was chosen for the \(\hat S_\text{k-NN}\) score function, where \(M\) is the shot number. This is in line with the choice in \cite{QuantumCP}, and with the asymptotic bounds required for consistency. For the \(\hat S_\text{KDE}\) and \(\hat S_\text{HDR}\) score functions, the Gaussian kernel was implemented and the bandwidth was chosen via Silverman's `rule of thumb'. When calculating scores, additional Gaussian noise with \(\sigma = 10^{-4}\) was added to break ties. Without this tie-breaking step, a large number of scores would have been equal due to the discrete nature of the grid mapping to output points.

Coverage properties are evaluated using a rolling window of recent test points, which preserves sensitivity to transient undercoverage that may be obscured by long-run averages. To evaluate the effect of adaptivity, we compare AQCP with its zero-step-size counterpart $(\gamma = 0)$. This corresponds to applying QCP with an expanding calibration dataset and no feedback adjustment, which we refer to as \emph{online QCP}.

In the efficiency study, we consider only AQCP with a positive step size. To provide a notion of optimality, we define $\gC^*$ as the class of prediction sets that minimise the expected prediction set size subject to the marginal coverage constraint. Further details on this optimality criterion are given in Appendix~\ref{app: optimality S1}.


\subsection{Local Coverage Results}
\label{sec:local coverage}
The efficacy of the adaptive recalibration mechanism in AQCP is now demonstrated. Figures~\ref{fig:knn_ibmq} and~\ref{fig:Other_coverage} illustrate the local coverage over a stream of test points for the score functions introduced with a moving average. The blue lines represent the online QCP algorithm (equivalent to AQCP with an adaptation step size of $\gamma=0$). The orange lines represent the AQCP algorithm with $\gamma=0.03$.
\begin{figure}[h]
    \centering
    \includegraphics[width=0.87\linewidth]{Figures/knn_ibmq.pdf}
    \caption{\textbf{Moving average coverage of AQCP (}$\gamma=0$, $0.03$\textbf{) on the multimodal regression task using shot data from \(\mathtt{ibm\_sherbrooke}\).} The k-NN score function. 100 initial calibration points are used with a rolling window of size \(500\), and a target miscoverage of $\alpha=0.1$.}
    \label{fig:knn_ibmq}
\end{figure}
\begin{figure}
    \centering

    \begin{subfigure}{0.87\linewidth}
        \centering
        \includegraphics[width=\linewidth]{Figures/dis_ibmq.pdf}
        \caption{Euclidean distance score function.}
    \end{subfigure}
    
    \begin{subfigure}{0.87\linewidth}
        \centering
        \includegraphics[width=\linewidth]{Figures/kde_ibmq.pdf}
        \caption{KDE score function.}
    \end{subfigure}

    \begin{subfigure}{0.87\linewidth}
        \centering
        \includegraphics[width=\linewidth]{Figures/hdm_ibmq.pdf}
        \caption{HDR score function.}
    \end{subfigure}
    \caption{\textbf{Moving average coverage of AQCP (}$\gamma=0$, $0.03$\textbf{) on the multimodal regression task using shot data from \(\mathtt{ibm\_sherbrooke}\).} Euclidean distance, KDE, and HDR score functions. 100 initial calibration points are used with a rolling window of size \(500\), and a target miscoverage of $\alpha=0.1$.}
    \label{fig:Other_coverage}
\end{figure}

Figure \ref{fig:knn_ibmq} presents the results obtained using the k-NN score function. The baseline of online QCP exhibits substantial deviations from the target coverage of $\alpha$. For example, it over-covers between the \(2{,}000\)--\(3{,}000\) test points, and later under-covers around test point \(8{,}500\). In contrast, AQCP shows greater stability. Once the initial rolling window is fully populated, AQCP consistently maintains the average coverage around the desired $90\%$ target level. The algorithm's ability to dynamically adjust its miscoverage estimate allows it to counteract prolonged over- or under-coverage.


Similarly, Figure~\ref{fig:Other_coverage} demonstrates the robust stabilising behaviour of AQCP across the Euclidean distance ($\hat S_\text{Euc}$), kernel density estimation ($\hat S_\text{KDE}$), and high-density region ($\hat S_\text{HDR}$) score functions. In all cases, the AQCP algorithm ($\gamma=0.03$) consistently maintains coverage closer to the nominal level than the online QCP algorithm ($\gamma=0$). This demonstrates that the choice of the score function does not substantially influence the coverage stability achieved by AQCP.



\subsection{Efficiency Results}

We now focus on the efficiency of AQCP set predictors, specifically examining the impact of the score function and the number of shots $M$ on the average prediction set size. Since non-stationary noise is not directly relevant to this analysis, all shots were generated using the $\mathtt{FakeQuitoV2}$ backend. To isolate the effect of the score function and shot number, the step size is fixed at $\gamma = 0.03$. Results are presented as piecewise linear curves of the average set size across ten logarithmically spaced shot values ranging from $M=1$ to $M=1{,}000$. 

Figure~\ref{fig:efficiency}(a) presents the efficiency results for our multimodal regression task. All score functions perform similarly for small shot numbers $M \leq 10$, after which their behaviours diverge. \sKde{} and \sHDR{} produce comparable average set sizes across all values of $M$, both showing a steady decline in average set size as $M$ increases logarithmically. \sHDR{} achieves the smallest average set size at $M = 1{,}000$. \sKde{} demonstrates a more rapid decrease in the medium $M$ range but plateaus for $M \geq 100$. $\hat S_{\text{Euc}}$ and $\hat S_\text{1-NN}$ exhibit different behaviour. While comparable performance was observed at small shot numbers ($M \leq 10$), they both produced substantially larger prediction sets across the entire range of $M$ values. At $M = 1{,}000$, $\hat S_{\text{Euc}}$ yielded prediction sets approximately $1.5$--$2$ times larger than those produced by $\hat S_{\text{HDR}}$, despite both achieving the target coverage level of $90\%$, as shown in Figure~\ref{fig:efficiency}(b).
\begin{figure}[h]
    \centering
    \includegraphics[width=1\linewidth]{Figures/set_size_gamma_03.pdf}
    \caption{\textbf{Average coverage and average set size of AQCP (\(\gamma = 0.03\)), evaluated across a range of shot numbers \(M\) using shot data from \(\mathtt{FakeQuitoV2}\).}  
        The desired miscoverage is set to \(\alpha=0.1\). Averages are computed from prediction sets returned from Algorithm~\ref{alg:AQCP_batch_predict} with \(100\) initial calibration points and \(9{,}900\) test points. 
        (a) Average prediction set size for a range of score functions.  
        (b) Corresponding average coverage for the same score functions.  
        The optimal line represents the performance of the benchmark \(\gC^*\) family of prediction sets.}
    \label{fig:efficiency}
\end{figure}

\subsection{Limitations and Future Work}

The empirical evaluation of robustness to non-stationary noise is based on hardware runs collected over a single day and therefore does not fully characterise longer-term drift or more structured perturbations. While the observed behaviour is consistent with the theoretical motivation for adaptive calibration, a more comprehensive assessment across multiple days, together with evaluation under controlled noise perturbations, would provide a stronger empirical stress-test. 

The behaviour of AQCP would be further clarified by a systematic step-size sensitivity analysis, as well as the incorporation of step-size schedules proposed for Adaptive Conformal Inference in related literature \citep{podkopaev2024adaptive, ACIforTimeSeries, StonlgyAdaptiveCP} in place of the fixed step size used here. We view such extended evaluation as an important direction for future work. Furthermore, empirical evaluation of the method of \cite{beyondexchangeability} (described in Appendix~\ref{app:BeyondEX}) may provide additional insight into robustness beyond the exchangeable setting.

Finally, we note that AQCP is inherently sequential: it updates the miscoverage level using observed outcomes and thus requires that each test response be revealed before the next prediction is made. It is therefore not directly applicable in batch prediction settings. Developing methods that are valid under non-stationary quantum noise in batch settings remains an open direction for future work.




