





Restless multi-armed bandits (RMABs), a model for constrained resource allocation among $N$ independent stochastic processes (arms), are widely studied.  Traditionally a \textit{binary-action} problem, in which a planner decides whether or not to act on each of $N$ arms, here we consider the \textit{multi-action-type} generalization \citep{killian2021multiAction,glazebrook2011general}, for its enhanced ability to capture challenging real-world planning problems. Salient examples of RMABs include scheduling \citep{bagheri2015restless,yang2018optimal}, machine replacement \citep{glazebrook2006some,ruiz2020multi},
aerial vehicle routing \citep{le2008multi},
anti-poaching patrol planning \citep{qian2016restless}, and healthcare \citep{lee2019optimal,mate2020collapsing}.
% \citep{biswas2021learn}
While these works have established important theoretical foundations, they share one key limitation: assuming stochastic dynamics are precisely known. 
% few works have reached real-world deployment, due largely to one key limitation: RMABs assume model parameters are known. 
However, having exact knowledge of dynamics is impossible in many real-world problems. E.g., in healthcare intervention planning, the probability that a patient will adhere to treatment after receiving an intervention is not perfectly known a priori; in anti-poaching patrol planning, the probability of finding a poacher's snare at some location is not known with certainty. 

Accordingly, methods have been developed to learn RMAB policies \textit{online}, assuming no a priori knowledge \citep{jung2019thompson,wang2020restless}. However, these methods require tens of thousands of samples to converge to good policies which is prohibitive for many real-world problems, e.g., in finite-length treatment settings such as tuberculosis \citep{mate2020collapsing} where there are only hundreds of rounds. Instead, real-world planners must make the most of noisy data at hand, estimating dynamics from historical data or consulting experts, inducing significant \emph{uncertainty}. Although RMAB techniques can be used to plan with point estimates, ignoring uncertainty can lead to arbitrarily bad policies.

% Accordingly, some techniques have been developed for learning RMAB policies \textit{online}---i.e., assuming no a priori knowledge---via Q-learning \citep{biswas2021learn,killian2021Q} or model-based methods \citep{jung2019thompson,wang2020restless}. However, these require tens of thousands of samples to converge to good policies which is prohibitive for many real-world problems, e.g., in finite-length treatment settings such as tuberculosis \citep{mate2020collapsing} or where dynamics change over time---making old samples obsolete---such as epidemic control \citep{bastani2021efficient}. Instead, real-world planners make the most of noisy data at hand by estimating dynamics from historical data or consulting experts, inducing significant \emph{uncertainty}. Although RMAB techniques can be used to plan with point estimates, ignoring uncertainty can lead to arbitrarily bad policies when measured on regret over the uncertainty space.

To address these shortcomings and push RMABs toward wider real-world applicability, we introduce \emph{Robust RMABs}, a generalization of RMABs which allows stochastic dynamics to be specified as uncertainty intervals, rather than point estimates. This new problem is very computationally demanding, adding a combinatorial layer of complexity onto an already PSPACE-Hard problem \citep{papadimitriou1994complexity}. Addressing this complexity gives rise to a rich set of challenges that necessitates the design of new techniques that not only help solve the robust objective we analyze, but also are of general interest to RMAB research.

Concretely, we plan under a \emph{minimax regret} objective, using a double oracle (DO) framework \citep{mcmahan2003planning} that has seen success in problems involving a \emph{single} Markov decision process (MDP) \citep{xu2021robust}. The DO approach casts the robust planning problem as a zero-sum game between an \emph{agent} oracle and adversarial \emph{nature} oracle. However, existing techniques fail for any non-trivially sized RMABs since the state and action spaces grow combinatorially in the number of arms $N$ and resource constraint $B$, respectively. 
Specifically, given $S$-sized state spaces for each arm, in the full combinatorial problem,  the state space is of size $S^N$, and the action space, and thus policy-network output, is of size $\binom{N}{B}$ (for binary-action RMAB---action space is larger for multi-action RMAB). At this size, we found that \citet{xu2021robust}, which tries to solve the full combinatorial problem as a single process, failed to learn good policies for RMABs as small as $N=5$ arms, with $B=3$ 
and $S=2$. Moreover, under the minimax regret objective, the nature oracle represents a particularly difficult challenge, since it requires jointly searching the RMAB policy space and the continuous, uncertain space of transition probabilities. Previously, this has been posed as a non-stationary RL problem and solved heuristically with a single policy network \citep{xu2021robust}.
% , but such techniques are known to suffer from convergence issues in general. 
We improve over this by reformulating the nature oracle as a multi-agent RL problem, and develop a multi-agent RL method for solving it for RMABs. In summary, our contributions are as follows:

% the double oracle framework requires two oracle algorithms which can be efficiently queried for \emph{best responses} --- for Robust RMABs, the \emph{agent oracle's} best response is a traditional RMAB policy and the \emph{nature oracle's} best response is an adversarial selection of model parameters within the uncertainty intervals. As we will demonstrate, no such efficiently queryable algorithms exist for the RMAB agent oracle, and no algorithms exist at all for the RMAB nature oracle. Thus, the design of these oracles will be two of our main contributions, which in sum are as follows:

\begin{enumerate}[leftmargin = *]
  \setlength\itemsep{0em}
\item We introduce the Robust RMAB problem with interval uncertainty over arm dynamics and develop techniques to solve a minimax regret objective via a DO approach.
\item To enable the DO approach, we introduce DDLPO, a novel deep RL algorithm for RMABs, of general interest. DDLPO tackles the combinatorial complexity of RMABs by learning an auxiliary ``$\lambda$-network'' in tandem with individual arm policy networks, which greatly reduces training sample complexity. The procedure implements the reward-maximizing agent oracle, has convergence guarantees, and solves RMABs with multiple action types \citep{killian2021multiAction,glazebrook2011general}, the first deep RL procedure to do so. DDLPO also easily extends to more general weakly-coupled MDPs \citep{adelman2008relaxations,hawkins2003langrangian} and enables computing continuous-action policies, a previously unstudied RMAB direction.
\item We formulate the non-stationary regret-maximizing nature oracle as a multi-agent RL (MARL) problem, a framework of potential general interest in robust planning. We solve it for the combinatorially-hard RMAB setting by extending DDLPO to include a shared critic and a continuous-action policy network for nature's selection of the uncertain transition dynamics.
\end{enumerate}
