\section{Background}

\subsection{Conformal Prediction}
\label{sec:CP_back}

There are two main variants of conformal prediction: full and split. Full conformal prediction is the original variant and is the most data-efficient \citep{vovk2005algorithmic, shafer2008tutorial,caprio-conformal-isipta}. Here we focus on split conformal prediction for its computational efficiency, and we will frequently refer to split conformal prediction as simply conformal prediction.

In split conformal prediction, prediction sets are constructed from a trained model, a calibration dataset $\mathcal{D}_\text{cal}=\{(x_i, y_i)\}_{i=1}^n$, and a test feature $x_{n+1}$. For each element of $\mathcal{D}_\text{cal}$, a real-valued score is computed, with higher scores assigned when the model's prediction conforms less to the target. Then, for each element $y$ of the target space, a candidate score is computed from $(x_{n+1}, y)$, and inclusion in the prediction set is determined by comparing this score to the calibration scores. Specifically, for a desired miscoverage rate $\alpha\in[0,1]$ and score function $\hat S : \gX\times \gY\to \mathbb{R}$, the procedure is as follows:
\begin{enumerate}
    \item Compute the calibration score \(s_i = \hat S(x_i, y_i)\) for each calibration point \((x_i, y_i) \in \mathcal{D}_\text{cal}\).
    \item Set \(\lambda\) equal to the \(\lceil (n+1)(1-\alpha) \rceil\)-th smallest value among \(s_1, \ldots, s_{n}, +\infty\).
    \item For a given test input \(x_{n+1}\), construct the prediction set:
    \[
    C(x_{n+1}) \coloneqq \left\{ y \in \mathcal Y: \hat S(x_{n+1}, y) \le \lambda \right\}.
    \]
\end{enumerate}
In what follows, we use uppercase letters (e.g.,\ $X_i, Y_i, S_i$) to denote random variables and distinguish them from their realised values (e.g.,\ $x_i, y_i, s_i$).

Under the weak assumption that the calibration and test points are exchangeable, meaning that their joint distribution is invariant under permutations of the indices (see Section~\ref{sec:exchangeability}), the following marginal coverage guarantee holds.
\begin{theorem}[\cite{vovk2005algorithmic, lei2017distributionfree}]
\label{Thm:marginal_coverage}
    If \((X_i, Y_i)\), \(i=1,\ldots,n\) are exchangeable, then for a new exchangeable draw \((X_{n+1}, Y_{n+1})\),
    \begin{align*}
    \mathbb{P}(Y_{n+1}\in C(X_{n+1})) \geq 1-\alpha.
    \end{align*}
    Additionally, if the scores \(S_1, \ldots, S_{n}\) have continuous joint distribution, then we have
    \begin{align*}
    \mathbb{P}(Y_{n+1}\in C(X_{n+1})) \leq 1-\alpha+\frac{1}{n+1}.
    \end{align*}
\end{theorem}
Here, the lower bound arises from \cite{vovk2005algorithmic} and the upper bound from \cite{lei2017distributionfree}.


\subsubsection{The Role of Exchangeability in Split Conformal Prediction}
\label{sec:exchangeability}
Exchangeability is the cornerstone of conformal prediction. This section gives an informal proof of the lower bound of Theorem~\ref{Thm:marginal_coverage} in the case of almost surely distinct scores, with particular attention to the role of exchangeability.

A finite set of random variables, $Z_1, \ldots, Z_{n+1}$, is said to be exchangeable if their joint distribution is invariant under any permutation of the indices. Formally, for any permutation $\sigma \in \mathrm{Sym}(n+1)$  (the permutation group of order $n+1$), we require that
\[
    (Z_1, \ldots, Z_{n+1}) \stackrel{d}{=} (Z_{\sigma(1)}, \ldots, Z_{\sigma(n+1)}),
\]
where $\stackrel{d}{=}$ denotes equality in distribution. This property is weaker than the i.i.d.\ assumption but implies that the order of the variables carries no statistical information. 

In the context of split conformal prediction, we consider a calibration dataset $\{(X_i, Y_i)\}_{i=1}^n$ and a new test point $(X_{n+1}, Y_{n+1})$, where $Y_{n+1}$ is unknown. If these $n+1$ pairs are exchangeable and we apply a fixed score function $\hat{S}(\cdot, \cdot)$ to each, then the resulting conformity scores $S_1, \ldots, S_{n+1}$ are also exchangeable. 

This preservation follows directly from \citet[Theorem~3]{kuchibhotla2020exchangeability}. The theorem shows that a transformation $G : (\mathcal X \times \mathcal Y)^{n+1} \to \mathbb{R}^{n+1}$ preserves exchangeability if it satisfies a specific permutation-equivariance condition; for any permutation $\pi_1 \in \mathrm{Sym}(n+1)$, there exists a corresponding permutation $\pi_2 \in \mathrm{Sym}(n+1)$ such that
\[
    \pi_1 G(w) = G(\pi_2 w), \quad \forall\, w \in (\mathcal X\times\mathcal Y)^{n+1}.
\]
In our setting, \(G\) corresponds to the map that assigns scores to data points, i.e.\ $G((X_i, Y_i)_{i=1}^{n+1}) = (\hat S(X_i,Y_i))_{i=1}^{n+1}$. Because $\hat S$ is applied identically to every calibration and test point, $G$ trivially satisfies the permutation condition, and thus the conformity scores inherit exchangeability from the data. 

The key insight is that exchangeability enables a probabilistic argument through ranking. Specifically, because the scores \((S_1, \ldots, S_{n+1})\) are exchangeable, the rank of \(S_{n+1}\) is uniformly distributed on \(\{1, 2, \ldots,n, n+1\}\). Therefore,
\[
    \mathbb{P}(\text{rank}(S_{n+1}) \leq \lceil(1-\alpha)(n+1)\rceil) = \frac{\lceil(1-\alpha)(n+1)\rceil}{n+1} \geq 1-\alpha.
\]
Let \(\tilde\lambda\) denote the \(\lceil (1-\alpha)(n+1) \rceil\)-th smallest value among \(\{S_1, \ldots, S_{n+1}\}\). We can rewrite the event
\[
    \{\operatorname{rank}(S_{n+1}) \le \lceil (1-\alpha)(n+1) \rceil\}
    =
    \{S_{n+1} \le \tilde\lambda\}.
\]
Furthermore, we can remove dependence on the test score by defining \(\lambda\) as the \(\lceil (1-\alpha)(n+1) \rceil\)-th smallest value among \(\{S_1, \ldots, S_n, +\infty\}\), and observing that
\[
    S_{n+1} \le \tilde\lambda
    \quad \Longleftrightarrow \quad
    S_{n+1} \le \lambda.
\]
Here, the forward implication follows from \(\tilde\lambda \le \lambda\), and the reverse from the fact that \(S_{n+1} > \tilde\lambda\) implies \(\lambda = \tilde\lambda\). Hence, for any significance level \(\alpha \in [0,1]\), we obtain
\[
    \mathbb{P}(S_{n+1} \le \lambda\,) \ge 1 - \alpha.
\]
As a result, defining $C(X_{n+1})$ as in Section~\ref{sec:CP_back} gives the lower bound of Theorem \ref{Thm:marginal_coverage}. Intuitively, this guarantee holds because exchangeability ensures that the test point has no special status amongst the calibration data. For more in-depth treatments, see \cite{vovk2005algorithmic,gentleintrocp}.

\subsection{Quantum Machine Learning} \label{sub: QML}

In this work, we employ the classical-data quantum-processing (CQ) paradigm of quantum machine learning, as introduced in \cite{engineerQML}. In this paradigm, classical data are fed into a quantum model, which is trained using a classical optimiser. We use the term \emph{quantum model} to refer to a parametrised quantum circuit (PQC). A PQC applies a unitary transformation $U(\theta)$, dependent on a vector of tunable parameters $\theta$, to a quantum state that encodes a classical input $x$ \citep{engineerQML}.

For a working understanding of the CQ paradigm, two key components warrant further explanation: the design of the unitary transformation via the construction of a PQC, and the encoding of the classical input $x$ into the circuit, referred to as quantum data encoding \citep{rath2024quantum, schuld2021supervised}. For a more general introduction to the QML landscape, see \cite{chang2025primerquantummachinelearning}.


\subsubsection{PQC Ans\"atze}\label{sub:ansatz}

In quantum computing, an ansatz defines the structure of a quantum circuit by specifying both the set of gates used and their configuration. This selection is analogous to choosing a model architecture in classical machine learning, where the design profoundly influences the capability and efficiency of the model \citep{benedetti2019parameterized}. Current research efforts focus on identifying optimal ans\"atze for various applications, particularly in the context of variational quantum algorithms \citep{qin2023review}. The hardware-efficient ansatz is a subclass of ansatz designs that mitigates the gate overhead incurred during circuit compilation \citep{leone2024practical}. It does so by reducing idle qubits and employing native entangling gates inherent to the hardware. This design is especially well suited to today's NISQ devices, where circuit depth and fidelity are constrained.

The hardware-efficient ansatz is constructed as a sequence of layers, each consisting of local parametrized single-qubit gates applied in parallel to every qubit, followed by a fixed entangling-gate pattern applied between specified qubit pairs. In many implementations, the local gates are chosen to be rotations about the Pauli axes, denoted $R_X(\theta)$, $R_Y(\theta)$, and $R_Z(\theta)$, respectively, where $\theta$ is the rotation angle. The entangling block is typically composed of a fixed pattern of CNOT (CX) or CZ gates that reflect the hardware's connectivity graph. Common entanglement schemes include linear (chain), circular, and all-to-all (full) connectivity \citep{engineerQML}. See Figure~\ref{fig: entangling configs} for circuit diagrams of these three entangling-block configurations, each implemented using CZ gates.


\begin{figure}[h]
   \begin{subfigure}[b]{0.3\textwidth}
    \centering
    \makebox[\textwidth][c]{%
        \Qcircuit @C=1.5em @R=1.5em {
          & \ctrl{1}  & \qw       & \qw       & \qw \\
          & \ctrl{-1} & \ctrl{1}  & \qw       & \qw \\
          & \qw       & \ctrl{-1} & \ctrl{1}  & \qw \\
          & \qw       & \qw       & \ctrl{-1} & \qw \\
        }
    }
    \caption{Linear}
  \end{subfigure}
  \hfill
  \begin{subfigure}[b]{0.3\textwidth}
    \centering
    \makebox[\textwidth][c]{%
        \Qcircuit @C=1.5em @R=1.5em {
          & \ctrl{1}  & \qw       & \qw       & \ctrl{3}  & \qw \\
          & \ctrl{-1} & \ctrl{1}  & \qw       & \qw       & \qw \\
          & \qw       & \ctrl{-1} & \ctrl{1}  & \qw       & \qw \\
          & \qw       & \qw       & \ctrl{-1} & \ctrl{-3} & \qw \\
        }
    }
    \caption{Circular}
  \end{subfigure}
  \hfill
  \begin{subfigure}[b]{0.3\textwidth}
    \centering
    \makebox[\textwidth][c]{%
        \Qcircuit @C=1.5em @R=1.5em {
          & \ctrl{1}  & \qw       & \qw       & \ctrl{2}  & \qw       & \ctrl{3} & \qw \\
          & \ctrl{-1} & \ctrl{1}  & \qw       & \qw       & \ctrl{2}  & \qw       & \qw \\
          & \qw       & \ctrl{-1} & \ctrl{1}  & \ctrl{-2} & \qw       & \qw       & \qw \\
          & \qw       & \qw       & \ctrl{-1} & \qw       & \ctrl{-2} & \ctrl{-3} & \qw \\
        }
    }
    \caption{Full}
  \end{subfigure}
 
    \caption{Diagrams of three entangling-block configurations within a four-qubit circuit implemented using CZ gates: (a) a linear entangling block, (b) a circular entangling block, and (c) a full entangling block.}
    \label{fig: entangling configs}

\end{figure}

\subsubsection{Angle Encoding}
\label{section:angle encoding}

In any supervised quantum machine learning algorithm, the classical data must be encoded into the quantum state prepared by the PQC. This encoding is an essential step in any CQ framework and can be achieved in several ways. Widely used strategies include basis encoding, which maps classical bits directly onto computational-basis states; amplitude encoding, which embeds a normalised feature vector into the probability amplitudes of a quantum state; and angle encoding, in which data values determine the rotation angles of quantum gates \citep{rath2024quantum, engineerQML}. 

We focus on angle encoding for its relevance to our experimental implementation and that of \cite{QuantumCP}. An angle encoder maps a classical feature vector to a set of rotation angles, each of which parametrises a single-qubit rotation gate within the PQC. In the simplest setting, an $n$-dimensional input $(x_1, \ldots, x_n)$ is encoded by applying single-qubit rotations on $n$ qubits, where the rotation angle assigned to the $i^\text{th}$ qubit is determined by the corresponding feature $x_i$. These local rotations prepare a product state $\ket{\psi(x)}$ whose representation in the Hilbert space reflects the structure of the input vector. Entanglement, if required, is typically introduced through subsequent entangling layers rather than the encoding itself. This embedding strategy implicitly defines a quantum kernel, dictating the expressivity and feature space of the quantum model \citep{schuld2021supervised}.


\subsection{Density Matrices and Noise Channels} 

To accurately describe quantum model operations and noise in the next section, we require density-matrix and noise-channel formalism, which we briefly introduce here. For a more comprehensive treatment, see \cite{keyl2002fundamentals, wilde2013quantum, fano1957description}.

A pure quantum state can be described in two equivalent ways: as a state vector $\ket{\psi}$ in a Hilbert space, or more generally as a density matrix $\rho$. While state vectors provide a convenient representation for pure states, density matrices extend this representation to mixed states, which are probabilistic mixtures of pure states. Formally, a density matrix is a positive semidefinite, Hermitian operator with unit trace,
\[
\rho = \sum_i p_i \ket{\psi_i}\bra{\psi_i}, \quad \sum_i p_i = 1,
\]
where $p_i$ is the probability that the system is prepared in the pure state $\ket{\psi_i}$. This formalism is essential for modelling realistic quantum systems, as noise and decoherence inevitably lead to mixed states. In the context of QML, density matrices are particularly useful for analysing how data encoding and variational circuits interact with hardware noise.


In current quantum hardware, quantum states are unavoidably affected by noise processes such as decoherence, gate errors, and measurement imperfections. These noise processes can be modelled as quantum channels, mathematically described by completely positive trace-preserving (CPTP) maps acting on density matrices \citep[Section~8.3]{GeneralBackgroundQuantumInformation}. A quantum channel $\mathcal{E}$ transforms a state $\rho$ as
\[
\rho \mapsto \mathcal{E}(\rho) = \sum_k E_k \rho E_k^\dagger,
\]
where $\{E_k\}$ are Kraus operators satisfying $\sum_k E_k^\dagger E_k = I$, and $E_k^\dagger$ denotes the conjugate transpose of \(E_k\). This operator-sum representation, known as the Kraus representation, provides a powerful and general framework for capturing the effects of noise on quantum computations.

Several standard noise channels are commonly used to model realistic quantum hardware \citep{GeneralBackgroundQuantumInformation}:
\begin{itemize}
    \item \textbf{Depolarising channel:} With probability $p$, the state is replaced by the maximally mixed state, modelling uniform random errors.
    \item \textbf{Phase flip channel:} Randomly introduces phase flips, capturing loss of coherence without affecting populations.
    \item \textbf{Amplitude damping channel:} Describes energy dissipation processes, such as spontaneous emission from an excited state to the ground state.
\end{itemize}
