\begin{table*}[th]
    \centering
    \caption{
        % (\textcolor{red}{Fix me if there exist some mistakes: The result of \citet{wan2023quantum} is expected regret bound}) 
        Comparison of the cumulative regret upper bounds with the classical and quantum algorithm, where $\beta_p>1,\beta_e>0$ are constants, $\delta\in(0,1)$ is the probability level, and $T>0$ is the total number of rounds,
        $\alpha > 0$ is any positive number.}
    \begin{tabular}{cccc}
    % {p{0.2\linewidth}p{0.2\linewidth}p{0.2\linewidth}p{0.2\linewidth}}
    \hline
    Reference & Kernel & Setting & Regret bound \\
    \hline \hline
    \multirow{2}{*}{\citet{vakili2021information}} & $\beta_p$ polynomial eigendecay & Classical & 
    $O \left(T^{\frac{\beta_p+1}{2\beta_p}}\log ^{1-\frac{1}{2\beta_p}}(T)\right)$ \\
     & $\beta_e$ exponential eigendecay& Classical & 
    $O\left(T^{\frac{1}{2}}\log^{1+\frac{1}{2\beta_e}}(T)\right)$ \\
    \citet{wan2023quantum} & $d$-dimensional linear & Quantum & $O\left(d^2\log^{5/2}\log (T)\right)$ \\
    \multirow{2}{*}{This paper (\Cref{thm:upper-bd})}  & $\beta_p$ polynomial eigendecay & Quantum & \cellcolor{black!10} 
    $\widetilde{O}\left( T^{\frac{3}{1 + \beta_p}} \log\left(\frac{1}{\delta} \right)\right)$ \\
     & $\beta_e$ exponential eigendecay & Quantum & \cellcolor{black!10} $\widetilde{O} \left( \log^{3(1 + \beta_e^{-1})/2} (T) \log\left(\frac{1 }{\delta} \right) \right)$ \\
    \hline
    \end{tabular}
    \label{tab:comp}
\end{table*}

% \textcolor{red}{
% (\emph{Abstract})
% We consider the quantum bandit problems where the underlying rewards are kernel functions rather than linear functions, and we focus on the case where the rewards are obtained through quantum reward oracles.
% We propose a UCB-type quantum algorithm and provide the cumulative regret bound under the conditions on the decay rate of eigenvalues of the kernel.
% When we have a polynomial eigendecay, the proposed algorithm attains $O(T^{4/\beta}\log^{2-2/\beta}(T)\log((T\log T)/\delta))$ cumulative regret bound with probability at least $1-\delta$.
% Whereas under the assumption of an exponential eigendecay, we can obtain the tighter bound on the cumulative regret as $O(\log^{2+2/\beta}(T)\log((\log T) / \delta))$ with probability at least $1-\delta$. 
% This study resolves the issue of the regret bound in the prior work which has a dependence on the dimension of the action set by considering the projection of the underlying Hilbert space onto the low dimensional finite subspace. (\emph{End of Abstract})
% % This study considers the projection underlying Hilbert space onto the finite-dimensional subspace, and we resolve the issue of the dependence on the dimension.
% % The theoretical results resolve the issue of the dependence on the dimension $O(d^2\log^{5/2} T)$ by considering the projection 
% % Our theoretical results resolve the issue of the dependence on the dimension by considering the projection of the underlying Hilbert space $\cH_k$ onto the finite-dimensional subspace of $\cH_k$.
% % This study is a non-linear generalization of prior work, but the results of theoretical analysis are non-trivial.
% % We consider the quantum bandit problems where the relationship between actions and mean rewards is non-linear, and we focus on the case where the reward is obtained through quantum reward oracles.
% % We propose a UCB-type quantum algorithm that attains $O(\text{poly}(\log T))$ regret bound under the condition that the decay rate of eigenvalues of the kernel is exponentially fast.
% % The algorithm proposed in this study is similar to prior work in the case where the action is linear with respect to the reward, and it appears that it can be simply extended non-linear case by kernelization.
% % However, the regret bound for the algorithm of prior work depends on the dimensionality of the action space, and we suffer from the dimensionality of the reproducing kernel Hilbert spaces (which is usually very large) when extending the existing method to a non-linear case.
% % We address the issue of dimensionality by considering a projection onto the smaller dimensional subspace of the Hilbert space associated with the kernelization.
% }

% \textcolor{red}{
Quantum machine learning is an emerging research field that attempts to enhance machine learning methods with quantum technology \citep{biamonte2017quantum,dunjko2018machine,schuld2018supervised,gyongyosi2019survey}.
The primary objective of quantum machine learning is to accelerate and improve the performance of classical machine learning algorithms using quantum computing paradigms and techniques. 
For instance, Grover's algorithm \citep{grover1996fast}, which is a well-known quantum algorithm for solving the problem of finding a unique item from an unstructured database of $N$ items, succeeded in reducing the time complexity to $O(\sqrt{N})$, while the classical method has a time complexity $O(N)$.
% }

% \textcolor{red}{
The study of quantum algorithms for the bandit problems has also attracted attention in the field of machine learning, and there is much interest in the quantum acceleration of classical algorithms that have been studied so far (\citep{gyongyosi2019survey,biamonte2017quantum}).
Many quantum algorithms for the classical bandit problems have been studied for various settings including best-arm identification (\citep{casale2020quantum,wang2021quantum}), exploration-exploitation with stochastic environments (\citep{wan2023quantum}), and adversarial environments (\citep{cho2023quantum}).
% }

% \textcolor{red}{
% The bandit problem is a fundamental problem that models a sequential decision-making problem under an uncertain environment, in which the player repeatedly chooses an action to play and receives a loss for the played action as feedback in every round (\citep{lattimore2020bandit}.
% Bandit problems have been applied to a wide range of practical applications such as reinforcement (\citep{sutton1999reinforcement}), recommendation systems (\citep{li2010contextual}), portfolio selection in finance (\citep{shen2015portfolio}), and many other fields (\citep{gyorgy2007line,villar2015multi,mueller2019low}).
% }

% \textcolor{blue}{
Following \cite{wan2023quantum},
this paper focuses on a sequential decision-making problem called the \textit{quantum bandit problem}. 
For a given fixed set of actions $\mathcal{X}$, 
for each round $t=1,2,\dots,T$, the player chooses an action $x_t\in\mathcal{X}$. 
The objective of the player is to maximize the cumulative reward $\sum_{t=1}^{T} \mu(x_t)$, and the performance is measured in terms of the cumulative regret over $T$ rounds, which is defined as
\begin{math}
    R(T) = \sum_{t=1}^{T} \left( \mu(x^\star) - \mu(x_t) \right),
\end{math}
where $\mu: \cX \rightarrow [0, 1]$ is the mean reward function, and $x^\star\in \argmax_{x\in \mathcal{X}} \mu(x)$ is the best action in hindsight.
During the game, the player has a chance to access the unitary operator (quantum circuit) $\mathcal{O}_{x}$ or its adjoint $\mathcal{O}^{\dag}_{x}$, referred to as \emph{quantum reward oracle}, that encodes the reward distribution associated with the action.
Invoking the quantum reward oracle and performing a measurement, the player can obtain the information about the reward but the number of query calls is limited up to $T$. 
One can apply any classical bandit algorithm to this problem setting, however, 
since the player can utilize quantum algorithms, an algorithm designated for this problem setting could perform much better.
We review the details of the notion in Sec.~\ref{subsec:reward-function}.
% Let $\mu:\mathcal{X}\to [0,1]$ be the mean reward function, and the objective of the player is to maximize the cumulative reward $\sum_{t=1}^{T} \mu(x_t)$, and the performance is measured in terms of the cumulative regret over $T$ rounds, which is defined as
% \begin{math}
%     R(T) = \sum_{t=1}^{T} \left( \mu(x^\star) - \mu(x_t) \right),
% \end{math}
% where $x^\star\in \argmax_{x\in \mathcal{X}} \mu(x)$ is the best action in hindsight.
% }

% \textcolor{red}{
% This paper focuses on the stochastic bandit problem, in which the player is given an action set $\mathcal{X}\subseteq\mathbb{R}^d$ and the total number $T$ of rounds.
% In each round $t=1,2,\dots,T$, the player chooses an action $x_t\in A$, and then the player observes  realized reward $\mu(x_t)$, where $\mu_t:\mathcal{X} \to [0,1]$ is a reward function.
% The objective of the player is to maximize the cumulative reward $\sum_{t=1}^{T} \mu_t(x_t)$, and the performance is measured in terms of the cumulative regret over $T$ rounds, which is defined as
% \begin{align}
%     R(T) = \sum_{t=1}^{T} \left( \mu(x^\star) - \mu(x_t) \right),
% \end{align}
% where $x^\star\in \arg\max_{x\in \mathcal{X}} \mu(x^\ast)$ is the best action in hindsight.
% }

% , say, $\mu(x) = x^\top\theta$, where $\theta\in\mathbb{R}^d$ is the unknown parameter.
\citet{wan2023quantum} studied the case where the reward function $\mu(x)$ is linear with respect to an action $x$.
By adapting the quantum Monte Carlo \citep{montanaro2015quantum} (QMC) method,
they proposed an algorithm called QLinUCB that attains $O(\text{poly}(\log T))$ regret bound.
% In \citet{wan2023quantum}, the quantum algorithm, referred to as QLinUCB, that attains $O(\text{poly}(\log T))$ regret bound was proposed.
% To achieve the quantum acceleration, \citet{wan2023quantum} adopted the Quantum Monte Carlo \citep{montanaro2015quantum} (QMC) method to explore the actions as a subroutine of the algorithm.
% Although the algorithm QLinUCB enjoys $O(\text{poly}(\log T))$ regret bound, however, a reward in many practical applications is usually nonlinear with respect to the action, which makes an obstacle to employ the algorithm to the practical application directly.
% To overcome the above issue, 
We extend the work of \citet{wan2023quantum} to the case when the mean reward function belongs 
to a reproducing kernel Hilbert space (RKHS) associated to a kernel $k$.
In the classical setting, the kernelized bandit problem is also known as Bayesian optimization.
In the literature on the classical bandit problem, 
researchers often consider an extension of linear bandits to the kernelized case.
This is possible since a confidence interval for the mean reward estimation is known in the kernelized case \citep{srinivas2010gaussian,chowdhury2017kernelized}, 
and can be used to derive regret bound in the same manner as in the linear case.
However, in the quantum kernelized bandit problem,
neither confidence intervals nor theoretical properties of algorithms designated for this problem setting have not been well studied.

% a similar result about a confidence interval is not studied for the quantum case, and the result of \citet{wan2023quantum} cannot be directly applied to the kernelized case.