Keywords: Stochastic multi-armed bandit, Risk-sensitive regret, Hamilton-Jacobi-Bellman equation, Continuous time-limit
Abstract: We study a class of stochastic multi-armed bandit problems with a risk-sensitive regret measure within a continuous limit setting. This problem is interesting when optimizing the expected reward is not the foremost objective, and the problem horizon is long. Through scaling the state parameters, including the number of pulls and cumulative reward for each arm we study the bandit problem with infinite horizon, we delineate such risk using a Hamilton-Jacobi-Bellman equation with quadratic growth. Using this approach, we establish an explicit form of the optimal policy associated with the considered risk. As an application, we present examples where the results obtained in continuous time offer insights into the optimal policy for each case. Finally, numerical experiments confirm the theoretical results are presented.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4560
Loading