Abstract: We consider an extension to the restless multi-armed
bandit (RMAB) problem with unknown arm dynamics, where an
unknown exogenous global Markov process governs the rewards
distribution of each arm. Under each global state, the rewards
process of each arm evolves according to an unknown Markovian
rule, which is non-identical among different arms. At each time,
a player chooses an arm out of N arms to play, and receives a
random reward from a finite set of reward states. The arms are
restless, that is, their local state evolves regardless of the player’s
actions. Motivated by recent studies on related RMAB settings, the
regret is defined as the reward loss with respect to a player that
knows the dynamics of the problem, and plays at each time t the
arm that maximizes the expected immediate value. The objective
is to develop an arm-selection policy that minimizes the regret.
To that end, we develop the Learning under Exogenous Markov
Process (LEMP) algorithm. We analyze LEMP theoretically and
establish a finite-sample bound on the regret. We show that LEMP
achieves a logarithmic regret order with time. We further analyze
LEMP numerically and present simulation results that support
the theoretical findings and demonstrate that LEMP significantly
outperforms alternative algorithms.
Loading