Restless and Uncertain: Robust Policies for Restless Bandits via Deep Multi-Agent Reinforcement Learning

Jackson A. Killian; Lily Xu; Arpita Biswas; Milind Tambe

Restless and Uncertain: Robust Policies for Restless Bandits via Deep Multi-Agent Reinforcement Learning

Jackson A. Killian, Lily Xu, Arpita Biswas, Milind Tambe

Published: 20 May 2022, Last Modified: 05 May 2023UAI 2022 PosterReaders: Everyone

Keywords: restless bandits, robustness, minimax regret, deep reinforcement learning

TL;DR: We introduce robustness to the popular restless multi-armed bandits problem, and solve the new, combinatorially challenging robust problem via a double oracle approach enabled by novel and generally interesting deep reinforcement learning methods.

Abstract: We introduce robustness in \textit{restless multi-armed bandits} (RMABs), a popular model for constrained resource allocation among independent stochastic processes (arms). Nearly all RMAB techniques assume stochastic dynamics are precisely known. However, in many real-world settings, dynamics are estimated with significant \textit{uncertainty}, e.g., via historical data, which can lead to bad outcomes if ignored. To address this, we develop an algorithm to compute minimax regret--robust policies for RMABs. Our approach uses a double oracle framework (oracles for \textit{agent} and \textit{nature}), which is often used for single-process robust planning but requires significant new techniques to accommodate the combinatorial nature of RMABs. Specifically, we design a deep reinforcement learning (RL) algorithm, DDLPO, which tackles the combinatorial challenge by learning an auxiliary ``$\lambda$-network'' in tandem with policy networks per arm, greatly reducing sample complexity, with guarantees on convergence. DDLPO, of general interest, implements our reward-maximizing agent oracle. We then tackle the challenging regret-maximizing nature oracle, a non-stationary RL challenge, by formulating it as a multi-agent RL problem between a policy optimizer and adversarial nature. This formulation is of general interest---we solve it for RMABs by creating a multi-agent extension of DDLPO with a shared critic. We show our approaches work well in three experimental domains.

Supplementary Material: zip

4 Replies

Loading