Keywords: h-likelihood, random effect model, reinforcement learning, stochastic multi-armed bandit
TL;DR: We propose random effect upper confidence bound algorithm based on h-likelihood procedure for stochastic multi-armed bandit with arm-dependent noise variance.
Abstract: Stochastic multi-armed bandit (SMAB) is a fundamental framework for sequential decision-making in reinforcement learning, where an agent must balance exploration and exploitation to maximize cumulative rewards. Recently, random effect SMAB has been proposed where reward feedback is modeled as random effect. However, it has not been well formulated yet in likelihood perspectives. Furthermore, individual noise variance can be arm-dependent. We propose a novel random effect upper confidence bound
(ReUCBHL) algorithm, based on h-likelihood. The likelihood approach is conceptually easy and can be implemented by simply minimizing the loss (negative h-likelihood). The algorithm can be applied to SMAB with univariate and multivariate rewards under arm-dependent noise variances. It can be further extended to contextual multivariate bandit. Theoretical justification and simulation studies demonstrate that ReUCBHL consistently achieves better regret performance compared to the baseline algorithms. These results highlight the effectiveness of the proposed algorithm.
Primary Area: reinforcement learning
Submission Number: 10585
Loading