In this paper, we proposed a UCB-type algorithm for quantum bandit problems where the reward function is non-linear with respect to an action.
By employing Mercer's theorem, we provided a theoretical analysis that the proposed algorithm achieves $O(\text{poly}(\log T))$ regret bound when the decay rate of Mercer operator decreases exponentially fast.
A limitation of this study is that the proposed method calls a Quantum Monte Carlo method in each round, which would require waiting for the advent of a fault-tolerant quantum computation.
% In addition, we do not know the optimality of our algorithm.
% A lower bound of the cumulative regret in this problem setting is an important open problem.
For future research direction, it would be intriguing to investigate the possibility of designing an algorithm that does not necessitate the computation of matrix inversion, such as Langevin Monte Carlo Thompson Sampling (LMC-TS) \citep{xu2022langevin} which is based on noisy gradient descent updates.
Moreover, the optimality of our algorithm remains unknown, and thus exploring a lower bound of the cumulative regret in this problem setting is an important open problem.