In this section, we present a UCB-type algorithm termed $\text{\ouralg}$ as illustrated in \Cref{alg:qmc-kernel-ucb} 
for the quantum bandit problem with a kernelized reward function.
We also introduce a confidence interval of our reward estimator (Proposition \ref{prop:conf-bd}).
\subsection{Proposed Method}
\strevision{To leverage the quadratic speedup of the QMC method (Lemma \ref{lem:qmc}),
we divide the time interval into several stages, which is similar to the doubling trick (c.f., \cite[Chapter 6]{lattimore2020bandit}).
For each stage $s = 1,2, \dots$, \ouralg{} plays an action $x_s \in \cX$,
and calls the QMC method $\qmc(\cO_{x_s}, \tradeoff \epsilon_s, \delta/\nstin)$ with the error tolerance $ \tradeoff \epsilon_s$,
and observes of an output $y_s$ of the QMC method,
where $x_s$ is an ``optimistic estimation'' of the best action $x^\star$, and $\nstin, \tradeoff$ are parameters of \ouralg{}.
We explain how to select the error $\epsilon_s$, the action $x_s$ below (and explain the parameter $\tradeoff$ in \Cref{sec:tradeoff-parameter}).
Since the QMC method calls the quantum reward oracle $\cO_{x_s}, \cO_{x_s}^\dagger$ for $\frac{\qmcubc}{\tradeoff\epsilon_s} \log(\nstin/\delta)$ times,
$x_s$ is an optimistic estimation of the best action at stage $s$,
the algorithm plays the same action $x_s$ for successive $\frac{\qmcubc}{\tradeoff\epsilon_s} \log(\nstin/\delta)$ rounds in stage $s$.
Due to the problem setting, it terminates if it consumes $T$ oracle queries.
}


% At the beginning of each stage $s=1, 2, \dots$, \ouralg{} computes an action $x_s \in \cX$,
% calls $\qmc(\cO_{x_s}, \tradeoff \epsilon_s, \delta/\nstin)$, and observes an output $y_s$ of the QMC method,
% where $\nstin, \tradeoff$ are parameters of \ouralg{}
% and we explain how to select the error $\epsilon_s$ and the action $x_s$ below.
% Since the QMC method queries the quantum reward oracles $\cO_{x_s}, \cO_{x_s}^\dag$ at most $\frac{\qmcubc}{\tradeoff\epsilon_s} \log(\nstin/\delta)$ times,
% in each stage $s$,
% \ouralg{} plays the same action for successive $\frac{\qmcubc}{\tradeoff\epsilon_s} \log(\nstin/\delta)$ rounds in stage $s$
% and terminates if it uses $T$ queries.

Because an output $y_s$ with a small estimation error $\epsilon_s$ is more informative than those with larger errors,
we consider the following weighted least estimation of the ground truth vector $\theta^* \in\cH_k$ with weights $1/\epsilon_i^2$:
\begin{align}\label{eq:lse}
    \hat{\theta}_s \in \underset{\theta\in\cH_k}{\argmin} \,\sum_{i=1}^{s} \frac{1}{\epsilon_i^2} \left( \phi(x_i)^\top\theta - y_i \right)^2 + \reg \|\theta\|_{\mathcal{H}_k}^2,
    % \hat{\theta}_s = \argmin_{\theta\in\cH_k}\,\sum_{i=1}^{s} \frac{1}{\epsilon_i^2} \left( \phi(x_i)^\top\theta - y_s \right)^2 + \reg \|\theta\|_{\mathcal{H}_k}^2,
\end{align}
where $\epsilon_i = \| x_i\|_{V_{i-1}^{-1}}$ for $1 \le i \le s$,
$\reg >0$ is a regularizing parameter,
and $V_s: \cH_k \rightarrow \cH_k$ is a positive-definite operator defined as
\begin{align*}
    V_s = \reg I_s + \sum_{i=1}^{s} \frac{1}{\epsilon_i^2} \phi(x_i)\phi(x_i)^\top = \reg I_s + \Phi_s^\top W_s \Phi_s,    
\end{align*}
and $\Phi_s,\, Y_s$ and $W_s$ are defined as follows:
\begin{math}
        \Phi_s = \left( \phi(x_1), \phi(x_2), \dots, \phi(x_s) \right)^\top, 
\end{math}
\begin{math}
        Y_s = (y_1, y_2, \dots, y_s)^\top,
\end{math}
\begin{math}
        W_s = \diag\left( 1/\epsilon^2_1, 1/\epsilon^2_2, \dots, 1/\epsilon^2_s \right).
\end{math}
Note that the above weighted least square estimator \eqref{eq:lse} can be represented as a closed-form, say, $\hat{\theta}_s = V^{-1}_s \Phi^{\top}_{s} W_s Y_s$.
As in the linear case \citep{wan2023quantum}, 
the weighted least estimator is a key feature of the algorithm to achieve $O(\text{poly}(\log T))$ regret bound (in the case of exponential eigendecay).

As previously mentioned, Algorithm \ref{alg:qmc-kernel-ucb} is a UCB-type algorithm and 
we need to compute an estimation $\muw_{s}(x) := \phi(x)^\trn \thetahat_{s}$ of $\mu(x)$
and an estimation error $\sigmaw_s(x) := \| \phi(x)\|_{V_{s}^{-1}}$ for each $x \in \cX$.
However, naively, computation of $\muw_{s}(x)$ and $\sigmaw_s(x)$ requires 
computation of the linear operator $V_s^{-1}$ defined on $\cH_k$, 
which is potentially infinite dimensional.
It is well-known that in the unweighted (and classical) case \citep{valko2013finite,srinivas2010gaussian},
one can compute estimations of $\mu(x)$ and their estimation errors by using values of kernels and 
finite dimensional linear algebra (i.e., kernel trick) due to the reproducing property of the RKHS.
The following proposition extends the well-known result to the weighted case. 
\begin{prop}[c.f. \cite{dai2023quantum}, Sec. 4.1]
    \label{prop:mu-sigma-kernel-trick}
    For $s \in \ZZ_{\ge 1}$ and $x \in \cX$, 
    we define $\muw_s(x) = \phi(x)^\trn \thetahat_{s}$ and $\sigmaw_s(x) := \| \phi(x)\|_{V_{s}^{-1}}$.
    We also define a matrix $K_s \in \RR^{s \times s}$ and a column vector $k_s(x) \in \RR^{s}$
    by $(K_s)_{ij} = (k(x_i, x_j))$ and
    $(k_s(x))_{i} = k(x, x_i)$ for $1 \le i, j \le s$.
    Then, we have the following.
    \begin{align*}
        \muw_s(x) &=  k_s(x)^\trn (\reg I_s + W_s    K_s)^{-1}W_s Y_s ,\\
        \reg\sigmaw_s^2(x) &= k(x, x) - k_s(x)^\trn (\reg I_s + W_s K_s)^{-1} W_s k_s(x).
    \end{align*}
\end{prop}
\subsection{Confidence Interval}
The following result provides a confidence interval of the estimation $\muw_s(x)$.
\begin{prop}
    \label{prop:conf-bd}
    Let $\totnst$ be the total number of stage of Algorithm \ref{alg:qmc-kernel-ucb}
    and $x_s$ be the action selected by Algorithm \ref{alg:qmc-kernel-ucb} 
    for each stage $s$.
    We assume that $\nstin \ge \totnst$, where $\nstin$ is the parameter of  
    Algorithm \ref{alg:qmc-kernel-ucb}.
    With probability at least $1-\delta$, 
    the following inequality holds for any $s=1,\dots, m$ and $x\in \cX$:
    \begin{equation*}
        \left| \mu(x) - \muw_s(x)\right| \le \beta_s \sigmaw_s(x).
    \end{equation*}
    Here, $\beta_s =\sqrt{\reg}S + \tradeoff \sqrt{s}$ with $\|\theta\|_{\cH_k} \le S$. 
\end{prop}
In the linear case,
\cite{wan2023quantum} proved a similar result
and in their result, $\beta_s$ is given as $O(\sqrt{ds})$, where $d$ is the dimension of the linear model.
However, since $\dim \cH_k$ is possibility infinite, 
their result is vacuous in our setting.

Although the proof is quite different, 
we note that Proposition \ref{prop:conf-bd} has some similarity to the known confidence interval in the classical setting
\citep{srinivas2010gaussian}.
In the classical setting, 
it is well-known that
a confidence interval of the form $|\mu_t(x) - \mu(x)| = O(\sqrt{\gamma_T}\sigma_t(x))$ holds \citep{srinivas2010gaussian,chowdhury2017kernelized},
where $\mu_t(x)$ and $\sigma_t^2(x)$ are the posterior mean and positive variance in the classical setting,
and $\gamma_T$ is the maximum information gain.
By Proposition \ref{prop:conf-bd}, 
we see that $ \left| \mu(x) - \muw_s(x)\right| = O(\sqrt{\totnst} \sigmaw(x)) $ and
as we shall see in Sec. \ref{sec:main-results}, 
the total number $\totnst$ of stages plays a similar role to the maximum information gain $\gamma_T$.

\begin{algorithm}[t]
    \caption{\ouralg}
    \begin{algorithmic}[1]
        \renewcommand{\algorithmicrequire}{\textbf{Inputs}:}
        \REQUIRE fail probability $\delta \in (0, 1)$, the total number of rounds $T$, 
        an upper bound of the total number of stages $\nstin$, and a tradeoff parameter $\tradeoff > 0$.
        \FOR{each stage $s=1,2,\dots$(terminate when we have used $T$ queries to all $\cO_x, \cO_x^\dag$)}
        \STATE $x_s \gets \argmax_{x \in \cX} \muw_{s-1}(x) + \beta_{s-1} \sigmaw_{s-1} (x)$.
        \STATE $\epsilon_s \gets \sigmaw_{s-1}(x_s)$.
        \STATE Run $\text{QMC}(\mathcal{O}_{x_s},\tradeoff\epsilon_s,\frac{\delta}{\nstin})$ obtain an output $y_s$ of QMC.
        \FOR{the next $\frac{\qmcubc}{\tradeoff\epsilon_s}\log \frac{\nstin}{\delta}$ rounds}
        \STATE play action $x_s$ and the player incurs regret $\mu(x^\ast) - \mu(x_s)$.
        \ENDFOR
        \ENDFOR
    \end{algorithmic} 
    \label{alg:qmc-kernel-ucb}
\end{algorithm}

\subsection{TRADEOFF PARAMETER}
\label{sec:tradeoff-parameter}
Both our algorithm (Algorithm \ref{alg:qmc-kernel-ucb}) and Q-GP-UCB \citep{dai2023quantum} 
are UCB-type algorithms that extend QLinUCB \citep{wan2023quantum} to the kernelized case.
However, we introduce a novel tradeoff parameter $\eta$ that tradeoffs the total number of stages 
and regret incurred in each state. 
Since we call the reward oracles $O(\frac{1}{\eta\epsilon_s})$ times, if $\eta$ is larger, 
then regret incurred in each stage will be smaller,
but the total number of stages will be larger.
We detail the dependence of the parameter $\eta$ on the cumulative regret in Proposition \ref{prop:regret-using-m}.


\begin{comment}
\subsection{Comparison to \cite{dai2023quantum}}
\label{subsec:method-comparison-to-qbo}
\cite{dai2023quantum} provided a similar confidence interval to Proposition \ref{prop:conf-bd} under the following assumption.
We discuss the validity of this assumption below.
\begin{assump}[Subgaussian Error Assumption of QMC]
    \label{assump:subgaussian-error}
    Let $y$ be a random variable taking values in $[0, 1]$
    and $\cO(y)$ the unitary operator corresponding to $y$ as in Lemma \ref{lem:qmc}.
    Let $\widehat{y}$ be an output of the QMC method $\qmc(\cO(y), \epsilon, \delta)$ introduced in Lemma \ref{lem:qmc}.
    Then, the error $y - \widehat{y}$ is $\epsilon$-subgaussian.
\end{assump}
\cite{dai2023quantum} claimed that this assumption is assured by Lemma \ref{lem:qmc}, 
however, Lemma \ref{lem:qmc} only states that $|y - \widehat{y}|$ is bounded by $\epsilon$ with a high probability.
Noting that the subgaussian property implies that the error $y - \widehat{y}$ is unbiased,
their argument implies the QMC estimator is unbiased.

An implementation the QMC method calls the quantum phase estimation algorithm repeatedly,
obtains estimated phases $\hat{\Theta}_1, \dots, \hat{\Theta}_n \in [0, 2\pi]$,
computes a median $\hat{\Theta} = \mathrm{Median}(\hat{\Theta}_1, \dots, \hat{\Theta}_n)$,
and outputs an estimation $(1 - \cos(\hat{\Theta}/2))/2$ of $\ex{y}$ (c.f., \cite{rebentrost2018quantum}).
Since each phase estimation $\hat{\Theta}_i$ includes an approximation error due to a finite number of qubits,
and the function $(1-\cos(x))/2$ is non-linear, to the best of our knowledge, there is no evidence 
that indicates the QMC estimator is unbiased.

Both our algorithm (Algorithm \ref{alg:qmc-kernel-ucb}) and Q-GP-UCB \citep{dai2023quantum} 
are UCB-type algorithms that extend QLinUCB \citep{wan2023quantum} to the kernelized case.
However, we introduce a novel tradeoff parameter $\eta$ that tradeoffs the total number of stages 
and regret incurred in each state. 
Since we call the reward oracles $O(\frac{1}{\eta\epsilon_s})$ times, if $\eta$ is larger, 
then regret incurred in each stage will be smaller,
but the total number of stages will be larger.
We detail the dependence of the parameter $\eta$ on the cumulative regret in Proposition \ref{prop:regret-using-m}.
\end{comment}