Abstract: In many practical sequential decision-making scenarios, we often face the problem of choosing a set of options rather than just one option. While sequential decision-making problems have been studied under a multi-armed bandit setting, much of the related literature deals with the simplest case where the agent chooses a single arm at each time step. The variant of the problem where the agent’s task is to choose a set of arms is called a combinatorial multi-armed bandit. The main aim of this paper is to study risk-aware algorithms for these problems. We consider such a problem with stochastic rewards and semi-bandit feedback and propose algorithms that maximize the Conditional Value-at-Risk (CVaR), a risk measure that takes into account the worst-case rewards achieved by the agent for the two cases of Gaussian and bounded arm rewards. We further analyze these algorithms and provide regret bounds. We believe that our results provide the first theoretical insights into combinatorial semi-bandit problems in the risk-aware case. Numerical experiments corroborate our theoretical findings.
Loading