% \textcolor{red}{
% (Comment: It would be better to split up the research on non-linear extensions of the bandit problem in classical algorithms and the research on the bandit problem in quantum algorithms.)
% }
This section consists of two parts: a comparison of this study with \citet{dai2023quantum}, and a review of existing work on kernelized bandits and quantum online learning algorithms.
\subsection{Comparison to Quantum Bayesian Optimization}
\subsubsection{Bias of the QMC Estimator}
\label{subsec:method-comparison-to-qbo}
As stated in \Cref{sec:intro}, the present paper deals with the same problem setting as \citet{dai2023quantum}.
In particular, \citet{dai2023quantum} provided a similar confidence interval to \Cref{prop:conf-bd} under the following assumption.
We discuss the validity of this assumption below.
\begin{assump}[Subgaussian Error Assumption of QMC]
  % \label{assump:subgaussian-error}
  Let $y$ be a random variable taking values in $[0, 1]$
  and $\cO(y)$ the unitary operator corresponding to $y$ as in \Cref{lem:qmc}.
  Let $\widehat{y}$ be an output of the QMC method $\qmc(\cO(y), \epsilon, \delta)$ introduced in \Cref{lem:qmc}.
  Then, the error $y - \widehat{y}$ is $\epsilon$-subgaussian.
\end{assump}
\citet{dai2023quantum} claimed that this assumption is assured by \Cref{lem:qmc}, 
however, \Cref{lem:qmc} only states that $|y - \widehat{y}|$ is bounded by $\epsilon$ with a high probability.
Noting that the subgaussian property implies that the error $y - \widehat{y}$ is unbiased,
their argument implies the QMC estimator is unbiased.

An implementation of the QMC method calls the quantum phase estimation algorithm repeatedly,
obtains estimated phases $\widehat{\Theta}_1, \dots, \widehat{\Theta}_n \in [0, 2\pi]$,
computes a median $\widehat{\Theta} = \mathrm{Median}(\widehat{\Theta}_1, \dots, \widehat{\Theta}_n)$,
and outputs an estimation $(1 - \cos(\widehat{\Theta}/2))/2$ of $\ex{y}$ (c.f., \cite{rebentrost2018quantum}).
Since each phase estimation $\widehat{\Theta}_i$ includes an approximation error due to a finite number of qubits,
and the function $(1-\cos(x))/2$ is non-linear, to the best of our knowledge, there is no evidence that indicates the QMC estimator is unbiased.

Although there are some recent methods for mitigating the bias of the QMC (or Quantum Amplitude) estimator,
to the best of our understanding, these improved methods are still biased and require a larger number of oracle queries
(see \citep{miyamoto2023bias} or references therein).


\subsubsection{Improved Regret Bounds}
\strevision{
As we discussed in the introduction and the remark after \Cref{thm:upper-bd},
in the case of polynomial eigendecay
our regret bound improves that of \citep{dai2023quantum} even under the unbiasedness assumption of the QMC estimator.
}

\subsubsection{Tradeoff Parameter}
Both our algorithm (Algorithm \ref{alg:qmc-kernel-ucb}) and Q-GP-UCB \citep{dai2023quantum} 
are UCB-type algorithms that extend QLinUCB \citep{wan2023quantum} to the kernelized case.
\strevision{
As we discussed in \Cref{sec:tradeoff-parameter},
we introduced a novel parameter $\eta$ that tradeoffs the total number of stages 
and regret incurred in each state, 
which provides aforementioned improved regret bounds.
}

\subsubsection{Proof Technique for Bounding the (weighted) Information Gain}
\strevision{
We note that \citep[Theorem 3]{vakili2021information} can be derived by \Cref{prop:log-det-ineq} 
by a similar argument when deriving \Cref{cor:gamma-bound}.
In particular, the proof of \Cref{prop:log-det-ineq} and analysis provided \Cref{subsec:sketch-proof} 
provide a simple alternative proof to \citep[Theorem 3]{vakili2021information}. 
Although \cite{dai2023quantum} proved the same result as \Cref{cor:gamma-bound},
we can say \Cref{prop:log-det-ineq} is a more general since 
both \citep[Theorem 3]{vakili2021information} and \Cref{cor:gamma-bound} are corollaries of this proposition.
}

% Since we call the reward oracles $O(\frac{1}{\eta\epsilon_s})$ times, if $\eta$ is larger, 
% then regret incurred in each stage will be smaller,
% but the total number of stages will be larger.
% We have detailed the dependence of the parameter $\eta$ on the cumulative regret in \Cref{prop:regret-using-m}.

\subsection{Related Work}
\begin{comment}
\textcolor{magenta}{In the classical setting,}
\cite{valko2013finite} discussed kernelized upper confidence bound algorithm in an RKHS setting, 
where the function space that the reward function belongs to is possibly infinite dimensional. 
To overcome this difficulty, they introduced \emph{effective dimension} and derive cumulative regret bound.
% By using it they gave cumulative regret scales. 
The effective dimension of an RKHS is essential same as the information gain \citep[Remark 1]{vakili2021information}.
\cite{vakili2021information} derived general upper bounds of the information gain under conditions on the eigendecay of the kernel. 
\end{comment}
In the classical setting, \citet{valko2013finite} discussed a kernelized UCB algorithm as an a non-linear extension of LinUCB \citep{li2010contextual}, and provided a cumulative regret bound based on a notion of the effective dimension. The effective dimension of an RKHS is essential same as the information gain \citep[Remark~1]{vakili2021information}. \citet{vakili2021information} derived general upper bounds of the information gain under conditions on the eigendecay of the kernel.

As for the prior works on bandit problems in the quantum setting, \citet{wan2023quantum} studied a quantum multi-armed bandits and stochastic linear bandits with linear reward model and introduced a quantum algorithm that enjoys quadratic speedup compared to the best possible classical result. \citet{dai2023quantum} extended the work \citep{wan2023quantum} to the case of a non-linear reward model and proposed a similar algorithm based on kernelization under the unbiasedness assumption of the QMC estimator. Besides these studies, \citet{Li2022QuantumSpeedups} studied a quantum bandit convex optimization problem and \citet{wang2021quantum} studied a best arm identification problem in the quantum multi-arm bandit setting. The algorithms proposed in these studies are also stage-based as in the present study and have been shown to achieve a quantum speedup compared to the classical algorithms. However, these algorithms are quite different from ours due to the different problem settings.

\begin{comment}
 \cite{Li2022QuantumSpeedups} 
 discussed the quantum algorithm for optimization of approximately convex functions and applied it  
 to zeroth-order stochastic convex bandits. They got the upper bound of regret $\Tilde{O}(n^5(\log(T))^2)$ and showed their algorithm achieved exponential speedup in $T$ compared to the classical lower bound $\Omega(n\sqrt{T})$. 
 \cite{Wu2023Quantumheavytailed} discussed 
  multi-armed bandits (MAB)
and stochastic linear bandits (SLB) with heavy tailed rewards and quantum reward oracle based on \citet{wan2023quantum}.
They gave regret bound $\tilde{O}\left(T^{\frac{1-v}{1+v}}\right)$ for some $v \in (0,1]$ , which is polynomially improving compared with the classical $\tilde{O}\left(T^{\frac{1}{1+v}}\right)$.
\cite{QuantumExplorationwn2021} studied quantum algorithm for identifying the best arm of a multi-armed bandit problem
and showed their algorithm achieved quadratic speedup compared to the classical setting. 
They showed $\tilde{O}\left(\sqrt{\sum_{i=1}^n\Delta_i^{-2}}\right)$ quantum queries. 
This means  $\tilde{O}\left(\sum_{i=1}^n\Delta_i^{-2}\right)$.
We note that $\Delta_i$ is the difference between
the mean reward of the best arm and the $i$th best arm. One quantum query means the one application of quantum oracle and Classical query means choosing one sample from one of the aims.
\end{comment}