Abstract: In this paper we consider Multi-Armed Gambler Bandits (MAGB), a stochastic random process in which an agent performs successive actions and either loses 1 unit from its budget after observing a failure, or earns 1 unit after a success. It constitutes a survival problem where the risk of ruin must be taken into account. The agent’s initial budget evolves in time with the received rewards and must remain positive throughout the process. The contribution of this paper is the definition of an original heuristic which aims at improving the probability of survival in a MAGB by replacing the time by the budget as the factor that regulates exploration in UCB-like methods. The proposed strategy is then experimentally compared to standard algorithms presenting good results.
Loading