\documentclass{uai2022} % for initial submission
% \documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%packages from NeurIPS paper
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\def\UrlBreaks{\do\/\do-}
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{amsthm}
\usepackage{amsmath}
\usepackage{bm}
% \usepackage{subfig}
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{thmtools}
\usepackage{thm-restate}
\usepackage{enumitem}
\usepackage{subcaption}
\usepackage{algorithm}
\usepackage[noend]{algorithmic}
\usepackage{mathtools}


\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{claim}{Claim}
\newtheorem{example}[theorem]{Example}
\newtheorem{cond}{Condition}
\newtheorem{remark}{Remark}
\newtheorem{proposition}{Proposition}
\theoremstyle{definition}
\newtheorem{definition}{Definition}%[section]


\newcommand\jk[1]{\textcolor{blue}{#1}}
\newcommand\lx[1]{\textcolor{olive}{[Lily: #1]}}




%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

% \title{Your Bandit Model is Not Perfect: Introducing Robustness to Restless Bandits Enabled by Deep Reinforcement Learning}
\title{Restless and Uncertain: Robust Policies for Restless Bandits \\ via Deep Multi-Agent Reinforcement Learning}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2022 paper}{Jane~J.~von~O'L\'opez}{}}
\author[1]{Harry~Q.~Bovik}
\author[1,2]{Further~Coauthor}
\author[3]{Further~Coauthor}
\author[1]{Further~Coauthor}
\author[3]{Further~Coauthor}
\author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    Computer Science Dept.\\
    Cranberry University\\
    Pittsburgh, Pennsylvania, USA
}
\affil[2]{%
    Second Affiliation\\
    Address\\
    …
}
\affil[3]{%
    Another Affiliation\\
    Address\\
    …
  }
  
  \begin{document}
\maketitle

\begin{abstract}
We introduce robustness in \textit{restless multi-armed bandits} (RMABs), a popular model for constrained resource allocation among independent stochastic processes (arms). Nearly all RMAB techniques assume stochastic dynamics are precisely known. However, in many real-world settings, dynamics are estimated with significant \emph{uncertainty}, e.g., via historical data, which can lead to bad outcomes if ignored. To address this, we develop an algorithm to compute minimax regret--robust policies for RMABs.
Our approach uses a double oracle framework (oracles for \textit{agent} and \textit{nature}), which is often used for single-process robust planning but requires significant new techniques to accommodate the combinatorial nature of RMABs. Specifically, we design a deep reinforcement learning (RL) algorithm, DDLPO, which tackles the combinatorial challenge by learning an auxiliary ``$\lambda$-network'' in tandem with policy networks per arm, greatly reducing sample complexity, with guarantees on convergence. DDLPO, of general interest, implements our reward-maximizing agent oracle. We then tackle the challenging regret-maximizing nature oracle, a non-stationary RL challenge, by formulating it as a multi-agent RL problem between a policy optimizer and adversarial nature. 
This formulation is of general interest---we solve it for RMABs by creating a multi-agent extension of DDLPO with a shared critic. We show our approaches work well in three experimental domains.
% Restless multi-arm bandits (RMABs) are receiving renewed attention for their potential to model real-world planning problems under resource constraints. However, few RMAB models have surpassed theoretical interest, since they make the limiting assumption that model parameters are perfectly known. In the real world, model parameters often must be estimated via historical data or expert input, introducing uncertainty. In this light, we introduce a new paradigm, \emph{Robust RMABs}, a challenging generalization of RMABs that incorporates interval uncertainty over parameters of the dynamic model of each arm. This uncovers several new challenges for RMABs and inspires new algorithmic techniques of general interest. Our contributions are:
% (i)~We introduce the Robust Restless Bandit problem with interval uncertainty and solve a minimax regret objective;
% (ii)~We tackle the complexity of the robust objective via a double oracle (DO) approach and analyze its convergence;
% (iii)~To enable our DO approach, we introduce DDLPO, a novel deep reinforcement learning (RL) algorithm for solving RMABs, of potential general interest. 
% % DDLPO learns an auxiliary ``$\lambda$-network'' in tandem with individual arm networks to reduce sample complexity while guaranteeing convergence. 
% The procedure also generalizes to continuous-action settings, the first algorithm of its kind for RMAB, as well as multi-action settings, the first deep RL algorithm to do so;
% (iv)~We design the first adversary algorithm for  RMABs, required to implement the notoriously difficult minimax regret adversary oracle and also of general interest, by formulating it as a multi-agent RL problem and solving with a multi-agent extension of DDLPO.
\end{abstract}

\section{Introduction}\label{sec:intro}


\input{2_intro}

\section{Related Work}
\label{sec:related_work}
%Killian et al.~\cite{killian2021multiAction} proposed a method that leverages the convexity of an approximate Lagrangian version of the multi-action RMAB problem. 


%The \textit{restless multi-armed bandit} (RMAB) problem was introduced by Whittle~\cite{whittle1988restless} where he showed that a relaxed version of RMAB problem can be solved optimally using a heuristic called \textit{Whittle Index policy}. This policy is shown to be optimal when the RMAB instances satisfy \textit{indexability} property. Moreover, Papadimitriou and Tsitsiklis~\cite{papadimitriou1994complexity} established that solving RMAB is PSPACE-hard, even for the special case when the transition rules are deterministic. 

\paragraph{RMAB} The reward-maximizing, binary-action RMAB problem was introduced by \citet{whittle1988restless}. His widely used Whittle index policy \citep{mate2020collapsing,glazebrook2006some,bagheri2015restless} is asymptotically optimal under \textit{indexability} \citep{weber1990index}. \citet{glazebrook2011general} and \citet{hodge2015asymptotic} extended the Whittle index to multi-action RMABs with special monotonic structure, while \citet{killian2021multiAction} gave a more general Lagrange-based method. \citet{hawkins2003langrangian} studied methods for weakly coupled Markov decision processes (WCMDP), which generalize multi-action RMABs to have multiple constraints, proposing Lagrangian solutions for small problems. \citet{adelman2008relaxations} and \citet{gocgun2012lagrangian} followed by providing better solutions to WCMDPs but sacrifice scalability. All these works assumed precise knowledge of stochastic dynamics. Some recent works have studied online RMABs with unknown dynamics but all have prohibitively large sample complexity ~\citep{gafni2020learning,jung2019regret,biswas2021learn,killian2021Q}. None consider robust planning under environment uncertainty, which we address. 

\paragraph{RL for RMAB} A few recent works learn Whittle indexes for indexable binary-action RMABs using (i)~deep RL (DRL) \citep{nakhleh2020neurwin} and (ii)~tabular Q-learning ~\citep{biswas2021learn,fu2019towards,avrachenkov2020whittle}. \citet{killian2021Q} take tabular Q-learning to the multi-action setting. In contrast, our DRL approach provides a more general solution to binary and multi-action RMAB domains, not requiring indexability or problem structure, and is far more scalable than tabular methods. We are also the first to handle continuous-action RMABs, key to the nature oracle. Also related is the space of combinatorial RL. However, most existing algorithms consider single-shot problems, e.g., traveling salesman \citep{kool2018attention,khalil2017learning}, which lack a notion of future state that is critical to solving any version of RMAB, and none accommodate the general cost/budget structure of multi-action RMAB \citep{song2019solving}; our methods address these limitations.

\paragraph{Robust planning} Work on robust planning in RL mainly focuses on maximin reward via robust adversarial RL \citep{pinto2017robust} or multi-agent RL (MARL) \citep{lanctot2017unified,li2019robust}, but maximin reward leads to overly conservative policies \citep{nguyen2014regret}. The minimax regret criterion \citep{braziunas2007minimax} avoids this pitfall, but this objective is challenging with very large or continuous strategy spaces. This can be addressed with the DO approach proposed by \citet{mcmahan2003planning} which explores a small subset of strategies while still guaranteeing optimal convergence \citep{gilbert2017double}. Subsequently, DO has been extended to optimize MARL problems with multiple selfish agents \citep{lanctot2017unified}. Recently, \citet{xu2021robust} used DO to solve a single Markov Decision Process (MDP) minimax-regret planning problem and used RL to design the oracles. However, when applied to RMABs, the number of outputs in their policy network grows exponentially, as does the size of the state space being learned, both of which require prohibitively long training times beyond trivially sized RMABs. Accordingly, we found that their RL algorithms failed to scale past $N=5$ arms and $S=2$ states, whereas we show in section ~\ref{sec:experiments} that our algorithms solve problems that are orders of magnitude larger. Additionally, their approach is designed only for continuous state/action spaces, whereas our approach is capable of finding robust policies for any combination of discrete/continuous state/action spaces. We accomplish this via our novel formulation of the nature oracle as a MARL problem, which decomposes the causes of non-stationarity, i.e., agent and nature, and learn them with separate networks.
% , and (2)~building off of the flexible PPO training procedure \citep{schulman2017proximal}.

% Robust planning has been identified as a critical concern in domains such as healthcare \cite{begoli2019need,ghassemi2019practical,wilder2017uncharted}, environmental conservation \cite{moilanen2006planning,regan2005robust,visconti2015building}, and urban planning \cite{shortridge2017robust,yao2009evacuation}, underlining the need to develop effective policies that are robust to uncertainty for these urgent real-world settings. 

% Robustness objectives have been considered for bandit applications in the two-action stochastic setting, where each pull of an arm draws a reward sampled from an unknown Bernoulli distribution. \citet{wei2021nonstationary} address minimax regret of time-varying reward shifts using heuristics to trade off remembering vs.\ forgetting, and \citet{garivier2016maximin} consider a maximin objective to guide exploration in Monte Carlo Tree Search. However, RMABs are significantly harder than the stochastic settings since, in RMABs, the rewards are dependent on the current state which in turn depends on the actions taken on the arms. 


% TODO: adversarial stochastic bandits - find one or two papers that talk about maxmin in stochastic bandits. which motivate why maxmin objectives are well-studied. but stochastic bandits are much simpler than restless bandits, so we're looking to generalize those results

\begin{figure*}[t]
\centering
\includegraphics[width=0.95\linewidth]{img/concept_fig.pdf} 
\caption{\textbf{(a)}~Proposed framework for solving the Robust RMAB problem. The main loop follows a DO approach to iteratively compute a minimax regret optimal RMAB policy where each oracle is a novel DRL algorithm for RMABs. 
% The set of agent RMAB policies map states to actions for the arms of the RMAB and the set of nature model parameter settings control the RMAB transition dynamics. Each loop $e$, we compute the optimal mixed strategy over agent and nature policies, then pass each mixed strategy to the opposing oracle which return best responses to add to their respective sets. Designing each oracle are two of our key contributions, requiring novel DRL algorithms for RMABs. 
\textbf{(b)}~The nature oracle: a novel multi-agent RL formulation of RMAB, that tackles non-stationarity with a centralized critic.}
% In particular, we tackle the non-stationarity of the regret-maximizing nature oracle by formulating it as a MARL problem. $\pi^{(A)}$ is a ``helper'' which learns the optimal policy, and thus the maximum return, while $\pi^{(B)}$ learns environment parameters which maximize regret of the current (fixed) agent mixed strategy, and is returned by the oracle.} 
\label{fig:concept} 
\end{figure*} 


\section{Preliminaries}
\label{sec:preliminaries}

We consider the multi-action RMAB setting with $N$ arms \citep{killian2021multiAction,glazebrook2011general}, which generalizes classical binary-action RMABs \citep{whittle1988restless}.\footnote{Our approaches also easily extend to weakly-coupled MDPs, which allow multiple budget constraints \citep{hawkins2003langrangian}, as well as to continuous-action RMABs, previously unstudied.} Each arm $n\in [N]$ follows an MDP $(\mathcal{S}_n, \mathcal{A}_n, \mathcal{C}_n, T_n, R_n, \beta)$, where $\mathcal{S}_n$ is a set of finite, discrete states; $\mathcal{A}_n$ is a set of finite, discrete actions; $\mathcal{C}_n : \mathcal{A}_n \xrightarrow[]{} \mathbb{R}$ defines action costs, where $\mathcal{C}_n[0] = 0$ encodes a no-cost ``passive action'' for all arms; $T_n: \mathcal{S}_n\times \mathcal{A}_n \times \mathcal{S}_n \xrightarrow[]{} [0,1]$ gives the probability of transitioning from one state to another given an action; $R_n:\mathcal{S}_n \xrightarrow[]{} \mathbb{R}$ is a reward function; and $\beta \in [0, 1)$ is the discount factor. For ease of exposition, let $\mathcal{S}_n, \mathcal{A}_n, \mathcal{C}_n,$ and $R_n$ be the same for all $n\in[N]$, and thus drop the subscript $n$, though all methods apply to the general case. Let $\bm{s}$ be an $N$-length vector of states over all arms and let $\bm{A} \in \{0,1\}^{N\times |\mathcal{A}|}$ be a decision matrix that one-hot-encodes the action taken on each arm. The planner computes policies $\pi$, which map states $\bm{s}$ to actions $\bm{A}$ with the constraint that, for each of $H$ rounds, the sum cost of actions is less than a budget $B$. 

% Let The aim of multi-action RMABs is to maximize total reward over a fixed number of $H$ rounds, subject to this budget constraint. 

We extend multi-action RMABs to the robust setting in which the exact transition probabilities are unknown. Instead, the transition dynamics $T_n$ of each arm $n \in[N]$ are determined by a set of parameters $\omega_n \in \Omega_n$, each within a given interval uncertainty $\underline{\overline{\omega}}_n:=[\underline{\omega}_{n}, \overline{\omega}_{n}]$. Let $\omega$ be a given parameter setting such that $\omega_{n} \in \underline{\overline{\omega}}_n$ for all $n \in [N]$. Let $G(\pi,\omega) = \mathbb{E}[\sum_{t=0}^{H}\beta^t \sum_{n\in [N]}R(\bm{s}^n_t) \mid \pi, \omega]$ be the planner's expected discounted reward under $\pi$ and $\omega$, where $\bm{s}_n^t$ is the state of arm $n$ at time $t$. Then, \emph{regret} is defined: 
\begin{align}
L(\pi,\omega) = G(\pi^\star_{\omega},\omega) - G(\pi,\omega) \ ,
\label{eq:regret}
\end{align}
where $\pi^\star_{\omega}$ is the optimal reward-maximizing policy under $\omega$. In our robust setting, our objective is to compute a policy~$\pi^{\dagger}$ that minimizes the maximum regret~$L$ possible for any realization of $\omega$, i.e.:
\begin{align}
    \pi^{\dagger} = \min_{\pi}\max_{{\omega}}{L(\pi,{\omega})} \ .
    \label{eq:minimax}
\end{align}
This problem is computationally expensive to solve since simply computing a policy~$\pi$ that maximizes the reward $G(\pi,\omega)$ is PSPACE-Hard \citep{papadimitriou1994complexity} even when the $T_n$ are known, i.e., $\omega$ is given. 

% This challenge is likely a key reason why the robust formulation has not yet been addressed. 
%However, we handle this complexity via a Lagrangian relaxation of the relevant underlying objective of maximizing expected reward (detailed in Section \ref{sec:lagrangian_relaxation}). 
% To overcome the complexity of the minimax optimization, we take a double oracle approach \citep{mcmahan2003planning}, which requires key innovations to work in the RMAB setting. 

% The double oracle approach achieves the minimax regret objective in Eq.~\ref{eq:minimax} by casting the optimization problem as a zero-sum game between two players, the \textit{agent} and \textit{nature}, visualized in Fig.~\ref{fig:concept}(a). The agent selects an RMAB policy that minimizes regret for some realization of the model parameters. Nature then adversarially selects the values of $\mathcal{P}_n$ that maximize regret for a given policy of the RMAB planner. This framework is desirable since it converges to an $\varepsilon$--optimal solution \citep{adam2021double,xu2021robust}, assuming the oracles return the best response for both players. The key technical contributions of this paper arise from designing the agent and nature oracles (right boxes of Fig.~\ref{fig:concept}(a)). 

% For the agent, minimizing regret with respect to a fixed nature strategy is equivalent to maximizing reward w.r.t.\ that strategy, so the agent objective is the same as solving a multi-action RMAB to find the best policy. 
%bsection{Lagrangian Relaxation} 
% \label{sec:lagrangian_relaxation}
%Let $\pi^\star_{\omega}$ be the optimal policy for a given multi-action RMAB defined by $\hat{\omega}$ parameter. Formally, 
A more tractable approach for computing multi-action RMAB policies $\pi$ is to utilize the Lagrangian relaxation \citep{hawkins2003langrangian,killian2021multiAction}, reproduced below.
For a given $\omega$, the optimal policy $\pi^\star_{\omega}$ maximizes the constrained Bellman equation:
\begin{equation}
\begin{aligned}\label{eq:combined_value_function}
    % J(\bm{s}) &= \max_{\bm{A}^c}\left\{\sum_{n=1}^{N} R(\bm{s}_n, \bm{A}^c_{nj}, \bm{s}_n^\prime) + \beta \mathbb{E}_{\omega}[J(\bm{s}^\prime) \mid \bm{s}, \bm{A}^c]\right\} \\
    J(\bm{s}) &= \max_{\bm{A}^c}\left\{\sum_{n=1}^{N} R(\bm{s}_n) + \beta \mathop{\mathbb{E}}_{\omega}[J(\bm{s}^\prime) \mid \bm{s}, \bm{A}^c]\right\} \\
    & \hspace{-6mm}\text{where } \bm{A}^c \subseteq \bm{A} \\ 
    \text{s.t. } &\sum_{n=1}^{N}\sum_{j=1}^{|\mathcal{A}|} \bm{A}_{nj}c_{j} \le B
    \qquad
    \sum_{j=1}^{|\mathcal{A}|} \bm{A}_{nj} = 1 \hspace{2mm} \forall n \in [N] 
\end{aligned}
\end{equation}
%However, this is an optimization problem with exponentially many states and actions, making it at least PSPACE-Hard to solve \citep{papadimitriou1994complexity}. 
where $\bm{A}_{nj} = 1$ if the $j^\text{th}$ action is taken on arm $n$ (else 0) and $c_{j} \in \mathcal{C}$ is the $j^\text{th}$ action cost. We then take the Lagrangian relaxation of the budget constraint~\citep{hawkins2003langrangian}, giving:
\begin{align}
    &J(\bm{s}, \lambda^\star) = \min_{\lambda} \left( \frac{\lambda B}{1-\beta} + \sum_{n=1}^{N}\max_{j\in|\mathcal{A}|}\{Q_n(\bm{s}_n, a_{nj}, \lambda)\} \right) \label{eq:decoupled_value_func} \\
    &\quad \text{where }\hspace{1mm} Q_n(\bm{s}_n, a_{nj}, \lambda) =
    R(\bm{s}_n) - \lambda c_{j} + \nonumber \\ 
    &\quad \qquad \beta \mathbb{E}_{\omega} \left[ Q_n(\bm{s}_n^{\prime}, a_{nj}, \lambda) \mid \pi^{La}_{\omega}(\lambda) \right] \ . \label{eq:arm_value_func_lagrange}
\end{align}
Here, $a_{nj}$ is the $j^\text{th}$ action of arm $n$, $Q$ is the state-action value function, and  $\pi^{La}_{\omega}(\lambda)$ is the optimal policy for a given $\lambda$. The key insight is that this relaxation decouples the value functions of the arms, except for the shared $\lambda$, i.e., for a given value of $\lambda$, all $Q_n$ could be solved via $N$ individual value iterations. However, finding and setting $\lambda:= \lambda^\star$ is critical to finding good policies for multi-action RMABs \citep{killian2021multiAction,glazebrook2011general}, where $\pi^{La}_{\omega}(\lambda^\star)$ is used to recover a policy that respects the original budget constraint by solving a knapsack with $Q_n(\bm{s}_n, a_{nj}, \lambda^\star)$ as values, $\mathcal{C}$ as weights, and the constraints of Eq.~\ref{eq:combined_value_function}, then taking the actions according to the $Q_n$ in the solved knapsack.
% and is asymptotically optimal in the binary-action case \citep{weber1990index}, i.e., $\pi^{La}_{\omega}(\lambda^\star) \xrightarrow[]{} \pi^\star_{\omega}$. Given this relationship, in the remainder of the paper, we focus on computing $\pi^{La}_{\omega}(\lambda^\star)$ and denote it as $\pi^\star_{\omega}$ for convenience. 

% Much effort has been invested in finding fast methods for computing $\pi^{La}_{\omega}(\lambda^\star)$ in the binary-action case \citep{glazebrook2006some,Sombabu2020,Liu2010} and recently in the multi-action case \citep{glazebrook2011general,hodge2015asymptotic}. \textbf{However, the best general method still relies on solving linear programs for each arm, which does not scale well to problems with very large state or action spaces. Therefore in this work, we will investigate methods that can solve Eq.~\ref{eq:decoupled_value_func} via deep reinforcement learning (RL), which has recently seen major success in finding optimal policies for large-scale MDPs. The key challenge to developing such RL techniques for Multi-Action RMABs will be in deriving a gradient update rule for $\lambda$ that allows us to converge to $\pi^{La}_{\omega}(\lambda^\star)$.}

\section{Solving Robust RMAB\lowercase{s}}
\label{sec:robust-rmab}

We now build our approach for finding robust RMAB policies, visualized in Fig.~\ref{fig:concept}(a). We use an iterative DO approach which achieves the minimax regret objective of Eq.~\ref{eq:minimax} by casting the optimization problem as a zero-sum game between two players, an \textit{agent} which learns policies $\pi$ to minimize regret, and an adversarial \textit{nature} which selects environment parameters $\omega$ to maximize regret of the agent. 
%The approach pitches the robustness problem as a two player zero-sum game with an \textit{agent} that aims to minimize regret and \textit{nature} that aims to maximize regret \citep{xu2021robust}. Double oracle is an iterative technique for finding an optimal solution \citep{mcmahan2003planning}, i.e., a mixed equilibrium strategy for both the agent and nature. 
In this two-player game, the \textit{pure strategy} space for the agent is the set of all feasible RMAB policies $\pi$ that respect the budget constraint. The pure strategy space for nature is a continuous, closed set of parameters $\omega$ within the given uncertainty intervals. The algorithm maintains a pure strategy set for the agent and nature (Fig.~\ref{fig:concept}(a) left boxes); each iteration, these strategy sets are used to compute a \textit{mixed strategy}---i.e., a probability distribution over pure strategies---Nash equilibrium in a regret game (Fig.~\ref{fig:concept}(a) center). Each oracle then learns a best response against the opponent's mixed strategy to add to its strategy set (Fig.~\ref{fig:concept}(a) right boxes). 

The agent oracle's goal is to find an RMAB policy $\pi$, or pure strategy, to minimize regret (Eq.~\ref{eq:regret}) given a nature \emph{mixed strategy} $\tilde{\omega}$. That is, the agent minimizes $L(\pi,\tilde{\omega})$ w.r.t.\ $\pi$, while $\tilde{\omega}$ is constant. Recall from Eq.~\ref{eq:regret} that
$L(\pi,\tilde{\omega}) = G(\pi^\star_{\tilde{\omega}},\tilde{\omega}) - G(\pi,\tilde{\omega})$.
Since $\tilde{\omega}$ and $\pi^\star_{\tilde{\omega}}$ are constant, then the first term $G(\pi^\star_{\tilde{\omega}},\tilde{\omega})$ is also constant. Thus minimizing $L(\pi,\tilde{\omega})$ is equivalent to maximizing the second term $G(\pi,\tilde{\omega})$, which is maximal at $\pi=\pi^\star_{\tilde{\omega}}$. In other words, the agent oracle must compute an optimal reward-maximizing policy w.r.t.\ $\tilde{\omega}$. Such a reward-maximizing objective aligns with existing RL techniques, but still requires that we address the challenge of learning in the combinatorial state and action spaces of the RMAB. To address this challenge, \emph{we propose a new RL learning method which decomposes the RMAB into $N$ per-arm learning problems and a complementary $\lambda$-network learning problem}, which together learn to spend limited budget where it will give the best return, detailed in the next section.

Conversely, the nature oracle seeks to find a parameter setting $\omega$, or pure strategy, that maximizes the agent's regret given a mixed strategy $\tilde{\pi}$, i.e., maximize $L(\tilde{\pi},\omega)$ with respect to $\omega$, while $\tilde{\pi}$ is fixed. This objective is even more challenging because both $G(\pi^\star_{\omega},\omega)$ and $G(\tilde{\pi},\omega)$ are functions of $\omega$. Most critically, computing $G(\pi^\star_{\omega},\omega)$ requires obtaining an optimal policy $\pi^\star_{\omega}$ as $\omega$ changes in the optimization---this amounts to a planning problem in which an agent must learn an optimal policy while the environment changes, controlled by $\omega$, making the nature oracle difficult to solve. Moreover, in the interval uncertainty setting we consider, $\omega$ is defined by a space of continuous values; thus nature's pure strategy space is infinite, making the problem even more complex, since it cannot be exhaustively searched. 

\emph{To tackle this complexity we propose a novel method for implementing the regret-maximizing nature oracle by casting it as a MARL problem}. The approach, visualized in Fig.~\ref{fig:concept}(b), trains one auxiliary agent to solve for a policy $\pi^\star_{\omega}$ ($\pi^A$ in Fig.~\ref{fig:concept}(b)), needed to compute $G(\pi^\star_{\omega},\omega)$ in the regret term, and simultaneously trains a second agent to learn worst-case parameters $\omega$ ($\pi^B$ in Fig.~\ref{fig:concept}(b)) that minimize $G(\tilde{\pi},\omega)$---together, these will maximize the regret  $L(\tilde{\pi},\omega)$. The non-stationarity is mitigated in this MARL setup by centralized critic networks which allow each agent to include the other's actions in their learned state space.
Ultimately, solving a MARL problem requires an RL algorithm to optimize the underlying policy; hence we first introduce our novel RL approach, DDLPO, to solve RMABs (Sec.~\ref{sec:rl-rmab}) as a part of our agent oracle and then use the algorithm as the backbone of our nature oracle (Sec.~\ref{sec:marl-rmab}).
%our multi-agent RL approach relies on the DDLPO algorithm described above as the underlying technique. 
%DDLPO will be the backbone of both the agent and nature oracles, so we describe it first, then provide our full double oracle algorithm.

%However, to solve a multi-agent reinforcement learning problem requires an algorithm for deep reinforcement learning the underlying problem. To the best of the authors' knowledge, no such algorithms exist for multi-action RMABs. Therefore, as our second main contribution we introduce RMAB Proximal Policy Optimization (DDLPO), a novel deep reinforcement learning algorithm for computing Lagrange policies for multi-action RMABs.

\subsection{Agent Oracle: Deep RL for RMAB}
\label{sec:rl-rmab}


\begin{algorithm}[t]
\caption{DDLPO}
\label{alg:ddlpo}
\begin{flushleft}
\textbf{Input}: Initial state $\bm{s}_0$, nature mixed strategy $\tilde{\omega}$ \\
\textbf{Parameters}: \texttt{n\_epochs}, \texttt{n\_subepochs}, \texttt{n\_steps}
\end{flushleft}
\begin{algorithmic}[1] %[1] enables line numbers
\STATE Randomly init.~policy net $\pi_{\theta_n}$ for each arm $n \in [N]$
\STATE Randomly init.~$\lambda$-network $\Lambda$
\STATE Initialize an empty \texttt{buffer}
\FOR{$\textit{epoch} = 1, 2, \ldots, \texttt{n\_epochs}$}
% \STATE Sample $\lambda$ from \textsc{Lambda} \\
\STATE Sample $\lambda = \Lambda(\bm{s})$
\FOR {$\textit{subepoch} = 1, \ldots, \texttt{n\_subepochs}$}
\FOR {timestep $t = 1, \ldots, \texttt{n\_steps}$}
\STATE Sample action $a_n = \pi_{\theta_n}(s_n, \lambda)$ \hspace{1mm} $\forall n \in [N]$
\STATE Add trajectories $(\bm{s}, \bm{a}, r, \bm{s}^\prime, \lambda)$ to \texttt{buffer} % NOTE: here we do not impose a budget constraint
\ENDFOR
\STATE Update arm policy networks $\pi_{\theta_n}$ via PPO, using tuples in \texttt{buffer} %, i.e., learn $Q(s, a, \lambda)$ and $\pi(s, \lambda)$ for a given $\lambda$
% \STATE \textbf{else if} $j \bmod \kappa = 0$ \textbf{then} \hspace{.5em} Freeze $\altpolicy$ parameters \\
% \STATE \textbf{else} \hspace{.5em}  Freeze $\attract$ parameters \label{line:wake-end} \\
% \STATE Update $\altpolicy$ and $\attract$ using gradient ascent to maximize regret: $\reward(\altpolicy, \attract) - \reward(\defpolicy, \attract)$
\ENDFOR
\STATE Update $\Lambda$ w/ sum discounted costs of final subepoch
\ENDFOR
% \STATE \textbf{return} $\attract$,  $\altpolicy$
\STATE \textbf{return} $\pi_{\theta_1}, \ldots, \pi_{\theta_N}$ and $\Lambda$
\end{algorithmic}
\end{algorithm}


% Existing index-based approaches for solving restless bandits are computationally infeasible for large problem sizes (many arms, many actions, or both) and cannot be extended to continuous states or actions. We present the first application of reinforcement learning to solve RMABs, which enables us to better scale while simultaneously accommodating continuous states and actions. Our RL implementation then becomes the foundation of our robust planning approach. 

Existing DRL approaches can be applied to the objective in Eq.~\ref{eq:combined_value_function}, but, as detailed in Section~\ref{sec:related_work}, they fail to scale past trivially sized RMAB problems since the action and state spaces grow exponentially in $N$.
% E.g., for a binary-action RMAB with $N=50$ and $B=20$, the action space would be of size $\binom{50}{20}\approx10^{12}$, which is not feasible to learn, even with a neural network. 
To overcome this, we develop a novel DRL algorithm that instead solves the decoupled problem (Eq.~\ref{eq:decoupled_value_func}). The key benefit of decoupling is to render policies and $Q$ values of each arm independent, allowing us to learn $N$ independent networks with \textit{linearly sized state and action spaces, relieving the combinatorial burden of the learning problem}.
% --- the above example simplifies to a tractable $N \times 2 = 100$ actions. 
However, this decoupling approach introduces a new technical challenge in solving the dual objective which maximizes over policies but minimizes over $\lambda$, as discussed in Sec.~\ref{sec:preliminaries}.

To solve this, we derive a dual gradient update procedure that iteratively optimizes each objective as follows: (1)~holding $\lambda$ constant, learn $N$ independent policy networks via policy gradient, augmenting the state space to include $\lambda$ as input, as in Eq.~\ref{eq:decoupled_value_func}; (2)~use sampled trajectories from those learned policies as an estimate to update $\lambda$ towards its minimizing value via a novel gradient update rule. Another challenge is that $\lambda^\star$ of Eq.~\ref{eq:decoupled_value_func} depends on the current state of each arm---therefore, a key element of our approach is to learn this function $\lambda^\star(\bm{s})$ concurrently with our iterative optimization, using a neural network we call the $\lambda$-network that is parameterized by $\Lambda$. To train the $\lambda$-network, we use the following gradient update rule.
\begin{restatable}[]{proposition}{lambdaUpdate}
\label{thm:lambda_update}
To learn the value $\lambda$ that minimizes Eq.~\ref{eq:decoupled_value_func} given a state $\bm{s}$, the $\lambda$-network, parameterized by $\Lambda$, should be updated with the following gradient rule:
%A gradient rule for updating the $\lambda$-network, parameterized by $\Lambda$, such that for a state $\bm{s}$, the $\lambda$-network predicts the value $\lambda$ that minimizes Eq.~\ref{eq:decoupled_value_func} is as follows:
\begin{equation}
\begin{aligned}
    \Lambda_t = \Lambda_{t-1} - \alpha \left( \frac{B}{1-\beta} + \sum_{n=1}^{N}D_n(s_n, \lambda_{t-1}(\bm{s})) \right) 
\end{aligned}
\end{equation}
where $\alpha$ is the learning rate and $D_n(s_n, \lambda)$ is the negative of the expected $\beta$-discounted sum of action costs for arm $n$ starting at state $s_n$ under the optimal policy for arm $n$ for a given value of $\lambda$.
\end{restatable}

% \begin{theorem}
% \label{thm:lambda-update}
%     \Lambda_t = \Lambda_{t-1} -  g(\bm{s}, \lambda_{t-1}(\bm{s})) = \Lambda_{t-1} - \alpha \left( \frac{B}{1-\beta} + \sum_{n=1}^{N}D_n(s_n, \lambda_{t-1}(\bm{s})) \right) 
% \end{theorem}
As $D_n$ lacks a closed form, the key insight we make is that it can be estimated by sampling multiple rollouts of the policy networks of all arms during training. As long as arm policies are trained for adequate time on the given value of $\lambda$, the gradient estimate will be accurate, i.e., $D_n(s_n, \lambda_{t-1}(\bm{s})) \approx -\sum_{k=0}^{K-1} \beta^k c^{k}_{n}$ where $K$ is the number of samples collected in an epoch and $c^{k}_{n}$ is the action cost of arm~$n$ in round~$k$. Moreover, this procedure will converge to the optimal parameters~$\Lambda^\star$ if the arm policies are optimal.
% \begin{equation}
% \begin{aligned}\label{eq:estimation_of_discounted_cost}
%     D_n(s_n, \lambda_{t-1}(\bm{s})) \approx -\sum_{z=0}^{T-1} \beta^z c_{nz}
% \end{aligned}
% \end{equation}

\begin{restatable}[]{proposition}{lambdaConvergence}
\label{thm:lambda_convergence}
Given arm policies corresponding to optimal $Q$-functions, 
% the gradient update rule of 
Prop.~\ref{thm:lambda_update} will lead $\Lambda$ to converge to the optimal as the number of training epochs and $K\xrightarrow[]{}\infty$.
\end{restatable}

Proofs are given in the appendix. One interesting feature of this update rule is that to collect samples that reflect the proper gradient, the RMAB budget must not be imposed \textit{at training time}---rather, the policy networks and $\lambda$-network must be allowed to learn to play the Lagrange policy of Eq.~\ref{eq:decoupled_value_func}, which learns to spend the correct budget in expectation, via our iterative update procedure. Therefore, at training time, we sample actions randomly according to the actor network distributions, without imposing the budget constraint. However, \textit{at test time, we always take actions in a way that respects the budget constraint} by following the knapsack procedure described at the end of section~\ref{sec:preliminaries}.
% When $g$ is negative, it will indicate \textit{overspending}, meaning $\Lambda$ should increase it's prediction for $\lambda$ at state $\bm{s}$ to make acting more expensive, encouraging the policy to make more effective use of the budget across all arms. When $g$ is positive, it will indicate that the full budget is not being spent in expectation, meaning $\Lambda$ should decrease its prediction of $\lambda$ at $\bm{s}$ to encourage more actions. 

In theory, the policy networks could be trained via any DRL procedure that ensures the above characteristics for training the $\lambda$-network. In practice, we train with proximal policy optimization (PPO) \citep{schulman2017proximal}, a state-of-the-art policy gradient approach. Importantly, PPO is also flexible enough to handle both discrete and continuous actions which is necessary for the nature oracle. 

Finally, to enable our iterative, dual-update procedure in practice, we need a mechanism to both (1)~explore new arm policy actions after an update to $\Lambda$, then (2)~exploit learned policy actions to develop good gradient estimates for $\Lambda$. We navigate this important trade-off by adding an entropy regularization term to the policy networks losses, controlled via a cyclical temperature parameter. %Importantly, PPO applies to both discrete and continuous action networks, which is necessary to extend our algorithm to the nature oracle. 
We call our algorithm Deep Distributed Lagrange Policy Optimization (DDLPO), provide pseudocode in Algorithm~\ref{alg:ddlpo}, and include more implementation details in the appendix.
% As long as we pick learning rates and update timings appropriately, it should be easy enough to show that this procedure converges to the optimal solution, i.e., Eq.~\ref{eq:decoupled_value_func}. In theory, this update procedure also should plug and play nicely with the multi-agent RL procedure proposed for the nature oracle.

% our solution is to train $n$ separate RL agents 
%Why combined RL approach doesn't work: doesn't scale, esp in discrete settings. So we need the lambda approach. [describe lambda network]







\subsection{Nature Oracle: Multi-Agent RL}
\label{sec:marl-rmab}

Armed with a DRL procedure for learning RMAB policies, we now develop the MARL procedure to implement the nature oracle. Recall the challenge of the nature oracle is to jointly optimize a policy $\pi^\star_{\omega}$ and model parameters~$\omega$. We propose to solve this optimization using MARL, designed to handle this form of non-stationarity \citep{lowe2017multi} via centralized critics. The procedure is visualized in Fig.~\ref{fig:concept}(b).

To implement the nature oracle, we introduce two agents $A$ and $B$, where $A$'s goal is to optimize the RMAB policy $\pi^\star_{\omega}$ and $B$'s goal is to find parameters $\omega$ that maximize regret of the current agent mixed strategy $\tilde{\pi}$. We define a shared transition function $T: \mathcal{S} \times \mathcal{A}_A \times \mathcal{A}_B \xrightarrow[]{} \mathcal{S}$. Here, $\mathcal{A}_A$ is the action space of the underlying multi-action RMAB. At a given state~$\bm{s}$, the action space $\mathcal{A}_B$ defines for agent~$B$ actions~$\omega$ which, in general, depend on $\bm{s}$. That is, at each step, agent $B$ will select environment parameters $\omega$, and thus transition probabilities that will influence the outcome of agent $A$'s actions. We adopt the centralized critic idea from multi-agent PPO \citep{yu2021surprising} to our RMAB setting to create MA-DDLPO. 
A notable strength of our MARL approach is that it allows the discrete-space policy of agent $A$ and the continuous-space policy of agent $B$ to be learned by separate networks, simplifying training compared to an alternative combined-network approach. Moreover, our choice to use PPO offers a convenient way to learn both types of policies as separate networks, while utilizing a single framework of update rules. % with minimal differences in implementation. 


% Formally, a multi-agent RL problem involves $Z$~agents, each with action space $\mathcal{A}_z$, a shared environment with states $\mathcal{S}$, and a shared environment transition function, $T:\mathcal{S} \times \mathcal{A}_1 \times \cdots \times \mathcal{A}_Z \xrightarrow[]{} \mathcal{S}$. Each agent can have arbitrary reward functions. A seminal paper introduced multi-agent deep deterministic policyhttps://www.overleaf.com/project/60624fdfafdb4bfe7133e86b gradient (MADDPG) \citep{lowe2017multi} for solving such tasks, where the core idea is to train centralized \textit{critics} which learn agent-specific Q-functions \textit{that have knowledge of all the other agents' actions}, and decentralized \textit{actors} which learn agent-specific policies. This setup is specifically designed to handle the non-stationarity induced when multiple agents influence one environment --- the centralized critic with knowledge of all agent actions allows each individual agent to learn as if the environment is stable by conditioning its value functions on the actions of the other agents, making learning itself more stable. We will build off of multi-agent PPO (MAPPO), a recent algorithm that uses the same centralized critic idea, but with a PPO training procedure \citep{yu2021surprising}.

% We use this formalism to implement the nature oracle as follows. There are two agents $A$ and $B$, where $A$'s goal is to optimize the RMAB policy $\pi^\star$  and $B$'s goal is to find parameters $\hat{\omega}$ that maximize regret of the current best agent strategy $\pi^\prime$. The transition function is $T: \mathcal{S} \times \mathcal{A}_A \times \mathcal{A}_B \xrightarrow[]{} \mathcal{S}$. The action space $\mathcal{A}_A$ will be the same as the action space of multi-action RMAB. At a given state $\bm{s}$, the action space $\mathcal{A}_B$ will allow agent $B$ to select $\hat{\omega}$ which, in general, may depend on $\bm{s}$. That is, at each step, agent $B$ will select environment parameters $\hat{\omega}$, and thus state/action transition probabilities that will determine the outcome of agent $A$'s actions. We adopt the centralized critic idea from MAPPO to our RMAB setting to create MA-DDLPO. Again, it is important that we use a PPO-based training procedure since agent $A$ has a discrete policy space but agent $B$ has a continuous policy space, and PPO offers a convenient way to train both policies with minimal differences in implementation. 

% Since DDLPO requires a cyclical procedure for updating the $\lambda$-network, the same will be true of DDLPO. 

% After training both agents simultaneously with MADDPG, each network will converge to a deterministic optimal policy -- most importantly, agent $B$'s policy will represent a deterministic setting of $\bm{\theta}$, i.e., a pure strategy for nature.

A critical step is then to define the rewards for agents $A$ and $B$ to match their objectives. Since agent $A$'s objective is to find $\pi^\star_{\omega}$, it adopts the reward defined by the underlying RMAB, i.e., ${R}^{(A)}(\bm{s}) = \sum_{n=1}^N R_n(\bm{s})$. However, agent $B$'s objective is to learn the regret-maximizing parameters $\omega$. This objective is challenging because it requires %maximizing regret of an input policy $\tilde{\pi}$, which relies on 
computing and optimizing over the returns of the fixed input policy $\tilde{\pi}$ with respect to all possible $\omega$, which is in general non-convex. In practice, to estimate the returns of $\tilde{\pi}_\omega$, we execute a series of roll-outs against agent $B$'s current action. %, but multi-step returns could be used where greater accuracy is required.
That is, given $\bm{s}$ at a given round, we sample an action from $\tilde{\pi}_{\omega}$ and the next state $\bm{s^\prime}$, and define the \textit{regret-based} reward of agent~$B$, as ${R}^{(B)} = \sum_{n=1}^N R_n(s_n) - \frac{1}{Y}\sum_{y=1}^{Y}r_y^{\tilde{\pi},\omega}$, where $r_y^{\tilde{\pi},\omega}$ is the reward from each of $Y$ one-step Monte Carlo simulations of the mixed strategy $\tilde{\pi}$ in $\omega$.
 %That is, for every $(\bm{s},\bm{a},r,\bm{s}^\prime)$ tuple sampled from the environment, we define regret here as $R_B = \sum_{n=1}^N R_n(s_n,a_n,s^\prime_n) - \frac{1}{Y}\sum_{y=1}^{Y}r_y^{\tilde{\pi}}$, where $r_y^{\tilde{\pi}}$ is the reward obtained from each of $Y$ random 1-step Monte Carlo simulations of mixed strategy $\tilde{\pi}$ from state $\bm{s}$.

To train the policies, agent $A$ has the same policy network architecture as DDLPO, i.e., $N$ discrete policy networks and one $\lambda$-network, and the agent $B$ actor network is a single continuous-action policy network. Since agent $A$ and $B$ have separate reward functions, they have their own critic networks, but these critics are \emph{centralized} in that they both take the actions of the other as input. Other than the centralized critic, agent $A$ is trained the same way as DDLPO, and agent $B$ is trained in a standard PPO fashion. In practice, to ensure good gradient estimates for agent $A$'s $\lambda$-network in MA-DDLPO, we keep agent $B$'s network---and thus the environment---constant between $\Lambda$ updates, updating $B$'s network with the same frequency as the $\lambda$-network updates. Pseudocode for MA-DDLPO and further details of its implementation are given in the appendix. %Algorithm~\ref{alg:marmabppo}.


\begin{algorithm}[t]
\caption{RR-DPO}
\label{alg:full-alg}
\textbf{Input}: Environment simulator and parameter uncertainty intervals $\overline{\underline{\omega}}_n$ for all $n \in [N]$ \\
\textbf{Parameters}: Convergence threshold $\varepsilon$%, number of perturbations $O$
\\
\textbf{Output}: Agent mixed strategy $\tilde{\pi}$
\begin{algorithmic}[1] %[1] enables line numbers
\STATE $\Omega_0 = \{\omega_0\}$, with $\omega_0$ selected at random
\STATE $\Pi_0 = \{\pi_{B_1}, \pi_{B_2}, \ldots\}$, where $\pi_{B_i}$ are baseline and heuristic strategies \\
\FOR{epoch $e = 1, 2, \ldots$}
\STATE Solve for $(\tilde{\pi}_e, \tilde{\omega}_e)$, mixed Nash equilibrium of regret game with strategy sets $\Omega_{e-1}$ and $\Pi_{e-1}$ \\
\STATE $\pi_e = \textsc{DDLPO}(\tilde{\omega}_e)$ \\
\STATE $\omega_e = \textsc{MA-DDLPO}(\tilde{\pi}_e)$ \\
\STATE $\Omega_e = \Omega_{e-1} \cup \{\omega_e\}, \Pi_e = \Pi_{e-1} \cup \{\pi_e\}$
\IF{$L(\tilde{\pi}_e, \omega_e) - L(\tilde{\pi}_{e-1}, \tilde{\omega}_{e-1}) \leq \varepsilon$ and $L(\pi_e, \tilde{\omega}_e) - L(\tilde{\pi}_{e-1}, \tilde{\omega}_{e-1}) \leq \varepsilon$}
\STATE \textbf{break}
\ENDIF
\ENDFOR
\STATE \textbf{return} $\tilde{\pi}_e$
% \STATE \textbf{return} $\tilde{\pi}_e$
\end{algorithmic}
\end{algorithm}



\subsection{Minimax Regret RMAB Double Oracle}
We now have all the pieces we need to run our robust algorithm Robust RMABs via Deep Policy Oracles (RR-DPO), visualized in Fig.~\ref{fig:concept}(a), with pseudocode presented in Algorithm~\ref{alg:full-alg}, adapted from the MIRROR framework \citep{xu2021robust}. 
%We cast the robust planning problem as a zero-sum game between an agent (who plans actions to take on the arms) and nature (who sets worst-case instantiations of the environment parameters within the uncertainty set). 
%The agent oracle, DDLPO, is given above in Section~\ref{sec:rl-rmab}. The nature oracle, MA-DDLPO, requires the greatest innovation, as we provide in Section~\ref{sec:marl-rmab}. 
% The double oracle approach proceeds as follows. 
%Each agent maintains a finite set of pure strategies, $\Pi$ for the agent and $\Omega$ for nature. Given each player's strategy set, we first compute a Nash equilibrium over the sets, giving an optimal mixed strategy for each player to play against its opponent. Next, each oracle is queried to produce a new pure strategy as a best response against the opponent's current mixed strategy. If each best response strategy is already in the players' strategy sets, then we terminate, having provably converged. Else we continue iterating.
% The agent maintains strategy set $\Pi$, initially empty, and nature maintains strategy set $\Omega$, initialized with an arbitrary parameter setting. In each iteration, we solve for a mixed Nash equilibrium in the regret game between the agent and nature to learn a mixed strategy ($\tilde{\pi}, \tilde{\omega}$) for each player. We then call the agent and nature oracles to compute best responses $\pi$ and $\omega$ to their opponent's strategy, which get added to their respective strategy sets $\Pi$ and $\Omega$. 
We use DDLPO to instantiate the agent oracle, MA-DDLPO for the nature oracle, and run RR-DPO until the improvement in value for each player is within a tolerance~$\varepsilon$ or until a set number of iterations.

We now establish conditions under which RR-DPO converges to the minimax regret optimal policy in finite iterations. For two-action RMAB, asymptotically, $\pi^{\textit{La}}_{\omega}\xrightarrow[]{}\pi^\star_\omega$. Thus, assuming each oracle returns true best responses, and under an analytical condition that is straightforward to achieve, i.e., finite pure strategy sets:\footnote{Straightforward to achieve for nature oracle via discretization.}

\begin{restatable}[]{proposition}{rrdpoConvergenceProposition}
\label{thm:rrdpo_convergence}
RR-DPO converges in a finite number of steps to the minimax regret-optimal policy.
\end{restatable}

\noindent In addition, we empirically verify that good policies are found for the multi-action case, and that RR-DPO converges using our continuous-strategy-space nature oracle. Further, we show that a policy that maximizes reward assuming a fixed parameter set can incur arbitrarily large regret when the parameters are changed (proofs in appendix). 

\begin{restatable}[]{proposition}{regretProposition}
\label{thm:regret}
In the Robust RMAB problem with interval uncertainty, the max regret of a reward-maximizing policy can be arbitrarily large compared to a minimax regret-optimal policy.
% Any non-robust reward-maximizing approach can achieve arbitrarily bad performance when evaluated in terms of regret.
\end{restatable}


% \subsection{Agent oracle}

% \subsubsection{Description}


% \subsubsection{Theory}



% \subsection{Nature oracle}

% \subsubsection{Description}

% MARL approach to solving nature oracle

% Motivation for using a policy gradient--based RL algorithm for the agent oracle is that we can build upon it to enable the nature oracle to differentiate directly through to optimize the environment parameters.

% \subsection{Combined double oracle}




% \begin{theorem}
% \label{thm:convergence}
% finite convergence? or stronger guarantee of convergence?
% \end{theorem}

% \begin{proof}[Proof sketch]
% TODO Lily: proof sketch here. chexck if it's just all the same as UAI result
% \end{proof}

% \begin{theorem}
% \label{thm:pure-strategy}
% TODO Arpita:

% ultimately reducing to a pure strategy from the mixed strategy: 
% sampling a pure strategy from mixed strategy is lower complexity than solving a pure strategy (and pure strategy mixed Nash equilibrium may not even exist). additionally, if there are any pure strategies that have probability 0 in the mixed strategy, then we know an optimal pure strategy exists without considering those strategies with probability 0

% potentially make a hardness claim for pure strategies? why we can't work with pure strategies
% \end{theorem}



% The main challenge in implementing the approach will be in adjusting the MADDPG algorithm to interact with the RL approach that we develop for solving the RMAB problem. For instance, if we went even with the simple approach proposed in section \ref{section:approach_agent_oracle}, we would have to adjust MADDPG to handle the fact that we need to sample longer-term reward trajectories, rather than one-step rewards. \textbf{One key question -- how did Xu et al., define rewards using their DDPG approach to solve for $\pi^\star$ and $z$ simultaneously? How to define a one-step reward that involves estimating the outcome of $\pi^\prime$?}


% Firstly, computing $\pi^\star_{\bm{\theta}}$ is PSPACE-Hard in general, so even evaluating $L$ is difficult in practice. One possible approach is to consider the binary-action BRMAB case, and replace $\pi^\star_{\bm{\theta}}$ with $\pi^W_{\bm{\theta}}$ where $\pi^W_{\bm{\theta}}$ is the \textit{Whittle index} policy and is known to be asymptotically optimal under the technical condition \textit{indexability}. General algorithms that are polynomial in $N$ exist for computing the Whittle index policy to arbitrary precision. This would give us a tractable method for computing $L$. In the Multi-action setting, an alternative option would be to use the techniques from \citep{} to compute $\pi^{La}_{\bm{\theta}}$ where $La$ denotes the \textit{Lagrange policy} (see \citep{}) and is not known to have guarantees, but has good performance. Another option would be to use reinforcement learning (RL) to compute $\pi^\star_{\bm{\theta}}$ directly. However, doing this will be challenging because of the exponentially large state and action spaces over which the RL algorithm would have to learn. Another option would be to develop some novel RL techniques for computing $\pi^{La}_{\bm{\theta}}$, which has a more tractable state space (not clear if this would be more efficient than using existing techniques for solving for $\pi^{La}_{\bm{\theta}}$).


% Now, the planner must take decisions for all arms jointly, subject to two constraints each round: (1)~select one action for each arm and (2)~the sum of action costs over all arms must not exceed a given budget $B$. Formally, the planner must choose a decision matrix $\bm{A} \in \{0,1\}^{N\times M}$ such that:
% \begin{align}
%     &\sum_{j=1}^{M} \bm{A}_{ij} = 1 \hspace{3mm} \forall i \in [N] \label{eq:single_action_constraint} \\
%     &\sum_{i=1}^{N}\sum_{j=1}^{M} \bm{A}_{ij}c_j \le B \label{eq:budget_constraint}
% \end{align}
% Let $\overline{\mathcal{A}}$ be the set of decision matrices respecting constraints \ref{eq:single_action_constraint} and \ref{eq:budget_constraint} and let $\bm{s} = (s^1, ..., s^{N})$ represents the initial state of each arm. The planner's goal is to maximize the total discounted reward of all arms over time, subject to constraints \ref{eq:single_action_constraint} and \ref{eq:budget_constraint}, as given by the constrained Bellman equation:
% \begin{equation}
% \begin{aligned}\label{eq:combined_value_function}
%     J(\bm{s}) = \max_{\bm{A}\in \overline{\mathcal{A}}}\left\{\sum_{i=1}^{N} r^i(s^i) + \beta E[J(\bm{s}^\prime) | \bm{s}, \bm{A}]\right\}
% \end{aligned}
% \end{equation}
% However, this corresponds to an optimization problem with exponentially many states and combinatorially many actions, making it PSPACE-Hard to solve directly \citep{papadimitriou1999complexity}. To circumvent this, we take the Lagrangian relaxation of constraint \ref{eq:budget_constraint}

\section{Experimental Evaluation}
\label{sec:experiments}
We first experimentally demonstrate the importance of robust planning in the presence of uncertainty using a hand-crafted synthetic domain (inspired by Prop.~\ref{thm:regret}). We then evaluate our algorithm on two challenging real-world-inspired public health planning scenarios which demonstrate the capability of our robust RMAB framework. 

We compare RR-DPO against five baselines. These baselines include three variations of the reward-maximizing approach from \citet{hawkins2003langrangian}, which, given fixed model parameters $\omega$, at each step computes a Lagrange policy, then chooses actions following the knapsack procedure described section ~\ref{sec:preliminaries}. The three variations are pessimistic (\textbf{HP}), mean (\textbf{HM}), and optimistic (\textbf{HO}), which assume the model parameters are set at the lower bound, mean, and upper bound of the given intervals for each arm. We also implement \textbf{RLvMid}, which \textit{learns} (rather than computes) a policy via DDLPO assuming \textit{mean} parameters, and \textbf{Rand}, which acts randomly to fill the budget. All results are averaged over 50 random seeds and were executed on a cluster running CentOS with Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.1 GHz with 8GB of RAM using Python 3.7.10. Our DDLPO implementation builds on OpenAI Spinning Up \citep{SpinningUp2018} and RR-DPO builds on the MIRROR implementation \citep{xu2021robust}, computing Nash equilibria using Nashpy 0.0.21 \citep{Knight2018}. Code is available in the supplement and hyperparameter settings are in the appendix.
% We show that DDLPO provides an effective RL solution to solving RMABs, particularly as problem sizes increase, and that MA-DDLPO offers max regret--minimizing solutions with reasonable runtime for realistic problems. 

\subsection{Experimental Domains}

\textbf{Synthetic} demonstrates that reward-maximizing policies (RLvMid, HP, HM, HO)
%corresponding to either pessimistic (\textbf{HP}), mean (\textbf{HM}), or optimistic (\textbf{HO}) 
may incur large regret in the presence of uncertainty. There are three binary-action arm types $\{U,V,W\}$, each with $\mathcal{C} = \{0, 1\}$, $\mathcal{S}=\{0,1\}$, $R(s)=s$, and the following transition matrix, with rows and columns corresponding to actions and next states, respectively:
\[T^n_{s=0}=
\begin{bmatrix}
    0.5  &  0.5 \\
    0.5  &  0.5
\end{bmatrix}, \hspace{2mm}
T^n_{s=1}=
\begin{bmatrix}
    1.0  &  0.0 \\
    1-p_n  &  p_n
\end{bmatrix}
\]
\[
p_U \in [0.00, 1.00],\hspace{0.5mm}
    p_V \in [0.05, 0.90],\hspace{0.5mm}
    p_W \in [0.10, 0.95]
% \begin{matrix}
%     p_U \in [0.00, 1.00] \\
%     p_V \in [0.05, 0.90] \\
%     p_W \in [0.10, 0.95]
% \end{matrix} \ .
\]
When an arm is at $s=0$, each action has equal impact on the state transition. When the arms are at $s=1$, selecting arms with high $p_n$ is optimal. This implies that policies can be specified by the order in which arms would be acted on, when they are in state $s=1$. Accordingly, $\pi_\textit{HP} = [W,V,U]$, $\pi_\textit{HM} = [W,U,V]$, and $\pi_\textit{HO} = [U,W,V]$. However, observe that there exist values of $p_n$ that can make each of the reward-maximizing policies incur large regret, e.g., for $\pi_\textit{HO}$ $p_U=0.0, p_V=0.9, p_W=0.1$ would induce an optimal policy $[V,W,U]$, which is the reverse of $\pi_\textit{HO}$. 

\textbf{ARMMAN} is a real-world \emph{maternal healthcare intervention problem} modeled as a binary-action RMAB \citep{biswas2021learn}. The goal is to select a subset of mothers each week to intervene on to encourage engagement with tailored automated maternal health messaging. The behavior of enrolled women is modeled by an MDP with three states: Self-motivated, Persuadable, and Lost Cause. We use the summary statistics given in their paper and assume uncertainty intervals of $0.5$ centered around the transition parameters, resulting in 6 uncertain parameters per arm (details in appendix). Similar to the setup by \citet{biswas2021learn}, we assume 1:1:3 split of arms with high, medium, and low probability of increasing their engagement upon intervention. In our experiments, we scale the value of $N$ in multiples of $5$ to keep the same split of arm categories of 1:1:3. 


\begin{figure*}[t]
    \centering
    \includegraphics[width=0.95\textwidth]{img/all_experiments.pdf}
    \caption{\textbf{(a--f)} Maximum policy regret of RR-DPO in robust setting for Synthetic (a,b), ARMMAN (c,d) and SIS (e,f) domains. Lower is better. Synthetic is scaled by 3 and ARMMAN by 5 to maintain the distributions of arm types specified in Section \ref{sec:experiments}. (e)~uses $S=50$ and (f)~uses $N=5,B=4$. RR-DPO beats all baselines by a large margin across various parameter settings. \textbf{(g--l)}~Policy returns of DDLPO for reward-maximizing setting (agent oracle) for synthetic (g,h), ARMMAN (i,j), and SIS (k,l) domains. Higher is better. (k)~uses $S=50$ and (l)~uses $N=5,B=4$. DDLPO is competitive across parameter settings.}
    \label{fig:all_experiments}
\end{figure*}

\begin{figure}[t]
    \centering
    \includegraphics[width=0.58\linewidth]{img/hawkins_slow_runtime_sis.pdf}
 \caption{The poor scaling of query time of the Hawkins baseline compared to DDLPO, as discussed in Section~\ref{sec:experiments}, even for relatively small problem sizes ($N = 10, B = 2$).}
    \label{fig:hawkins_bad_runtime}
\end{figure}

\textbf{SIS Epidemic Model} is a discrete-state model in which arms represent distinct geographic regions and each member of an arm's population of size $N_{\textit{p}}$ is either (\textbf{S})usceptible to or (\textbf{I})nfected with an infectious disease. Such models have been the subject of increased interest following the COVID-19 pandemic \citep{hinch2021openabm,kerr2021covasim}, and will represent a large-state and multi-action experimental domain. In our model, the count of \textbf{S} members of the population is the state of each arm. Each arm's SIS model is defined by parameters $\kappa$, the average number of contacts per round, and $r_{\textit{infect}}$, the probability of infection given contact with an \textbf{I} member. Details on computing discrete state transition probabilities from these parameters are derived from \citet{yaesoubi2011generalized} and given in the appendix. We introduce three intervention actions $\{a_0, a_1, a_2\}$ with costs $c=\{0, 1, 2\}$. Action $a_0$ represents no action, $a_1$ represents messaging about physical distancing (divides $\kappa$ by $a^{\textit{eff}}_1$), and $a_2$ represents distributing face masks (divides $r_{\textit{infect}}$ by $a^{\textit{eff}}_2$). We impose the following uncertainty intervals: $\kappa \in [1, 10]$, $r_{\textit{infect}} \in [0.5, 0.99]$, $a^{\textit{eff}}_{\{1,2\}} \in [1, 10]$.



% \noindent\textbf{Robust Double Oracle}
% \label{sec:experiments-do}





% \begin{figure}[t]
%     \centering
%     \begin{subfigure}[t]{0.325\linewidth}
%         \centering
%         \includegraphics[width=\textwidth]{example-image-a}
%         \caption{Counterexample domain, varying $n$}
%         \label{fig:counterexample-n}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}[t]{0.325\linewidth}
%         \centering
%         \includegraphics[width=\textwidth]{example-image-b}
%         \caption{ARMMAN domain, varying $n$}
%         \label{fig:armman-n}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}[t]{0.325\linewidth}
%         \centering
%         \includegraphics[width=\linewidth]{example-image-c}
%         \vfill
%         \caption{Domain C, varying $n$}
%         \label{fig:van-n}
%     \end{subfigure} \\
    
%     \begin{subfigure}[t]{0.325\linewidth}
%         \centering
%         \includegraphics[width=\textwidth]{example-image-a}
%         \caption{Counterexample domain, varying budget}
%         \label{fig:counterexample-budget}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}[t]{0.325\linewidth}
%         \centering
%         \includegraphics[width=\textwidth]{example-image-b}
%         \caption{ARMMAN domain, varying budget}
%         \label{fig:armman-budget}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}[t]{0.325\linewidth}
%         \centering
%         \includegraphics[width=\linewidth]{example-image-c}
%         \vfill
%         \caption{Domain C, varying budget}
%         \label{fig:van-budget}
%     \end{subfigure}
%     \caption{Performance of algorithms across all settings, evaluated in terms of regret}
%     \label{fig:performance}
% \end{figure}
\subsection{Performance of RR-DPO}
First, we evaluate the performance of the algorithms in uncertain environments. We compute the regret of an agent's pure strategy $\pi$ against a nature pure strategy $\omega$ as the difference in the average reward obtained by $\pi$ against $\omega$ and the average reward of the best strategy in the experiment against $\omega$. The average reward is the discounted sum of rewards over all arms for a horizon of length $10$, over $25$ simulations. In each setting, DO runs for $6$ epochs, using $100$ rollout steps and $100$ training epochs for each oracle. After completion, each baseline strategy is evaluated by querying the nature oracle for the best response against that strategy, then computing max regret against all $\omega$. The regret of RR-DPO is computed as the utility of the agent mixed strategy returned by the DO over the two-player regret game.
% Hawkins is an online algorithm; converting it to offline would take exponential space in our problem setting

Fig.~\ref{fig:all_experiments}(a--f) shows RR-DPO incurs the lowest regret, beating the baselines in all domains. (a,b) shows results on the synthetic domain, demonstrating our approach can reduce regret by \char`\~$50\%$ against the benchmarks, across various values of $N$ and $B$. 
% This is expected because the domain is designed to ensure large regret for HP, HM, and HO baselines. 
% Here, a benefit of $50\%$ in the regret corresponds to, in the worst case, keeping $1/3$ of the arms in the good state for an extra round compared to the baseline. 
Moreover, as $B$ increases, the regret incurred may increase, since higher budget implies better reward potential for the optimal policy; however, the regret for RR-DPO remains small even as $B$ grows. 
Similarly, for the ARMMAN domain (c,d), a challenging domain adapted from a real-world problem, our algorithm performs consistently better than the baselines, achieving regret that is around $50\%$ lower than the best baselines. In the SIS domain (e--f), another real-world planning setting with a larger state space and multiple actions, our results are robust across parameter settings. Importantly, this holds even as we increase the state space from $S=100$ to $500$ (f), in which running the Hawkins baseline becomes prohibitively expensive. 

\emph{Finally, we run sensitivity analyses of the algorithms against $H$ and the size of the uncertainty sets}, given in the appendix. When $H$ varies from 10 to 100, RR-DPO maintains very low regret, while competitor regret as much as doubles, increasing RR-DPO's relative improvement as high as \char`\~60\%. Similar results are obtained when varying the uncertainty intervals between 0.25, 0.5 and 1.0 times their widths from the experiments in Fig.~\ref{fig:all_experiments}, with RR-DPO always dominating.

\subsection{Performance of DDLPO}
We also evaluate the performance of DDLPO, our novel DRL approach to find reward-maximizing policies for multi-action RMABs, which implements our agent oracle. We compare against \textbf{No Action} and \textbf{Random} baselines as well as the computationally intensive solution by Hawkins which computes the Lagrange policy, but which requires exact model parameters and discrete states/actions. Hawkins upper bounds DDLPO for small discrete problems since it is exact whereas DDLPO learns the Lagrange policy from samples. Each experiment is a traditional reward-maximizing RMAB instantiated with a random sample of valid parameter settings for each seed.

Fig.~\ref{fig:all_experiments}(g--l) shows DDLPO achieves reward comparable to the Hawkins algorithm and significantly better than random, providing insight into the success of our RR-DPO approach which DDLPO enables, and showing promise for DDLPO as an algorithm of general interest. In the synthetic domain (g,h), DDLPO learns to act on the $33\%$ of arms who belong to category $W$. The mean reward of DDLPO almost matches that of Hawkins algorithm as $N$ scales with a commensurate budget (g). As we fix $N$ and vary the budget (h), the optimal policy accumulates more reward, and DDLPO almost equals the optimal. We observe similar results on the ARMMAN domain (i,j), a challenging real-world health problem. On the SIS domain (k,l), the strong performance of DDLPO holds in a multi-action setting even as we increase the number of states from 50 to 500 (l). %The bottom-right plot shows that DDLPO achieves reasonable reward. %Additional results for larger settings are included in the appendix.

Moreover, DDLPO beats Hawkins computationally: in Fig.~\ref{fig:hawkins_bad_runtime}, a single rollout ($10$ rounds) of Hawkins takes \char`\~$100$ seconds when there are $500$ states, scaling quadratically in general. This demonstrates that it would be prohibitive to run Hawkins in the loop of RR-DPO, since agent policies are evaluated thousands of times to compute the regret matrices. For just $25$ simulations, computation would take \char`\~$42$ minutes to evaluate a single cell in the regret matrix, which has $|\Pi| \times |\Omega|$ total cells. %The key computational bottleneck with Hawkins is that it requires solving a linear program for the current state profile, that would subsequently change in the next timestep. Additional results for larger settings are included in the appendix. 

% \paragraph{Limitations}
% \label{sec:limitations}
% We believe these advancements have the potential to improve resource allocation in low-resource settings, but acknowledge they are not without tradeoffs. For example, the baseline methods we compare against, while less robust, can provide interpretable `index' policies that capture the value for acting on an arm, whereas our solution's output can be difficult for a domain expert to interpret. Further, such optimization tools have the risk of amplifying underlying biases in the data and translating that to unfair resource allocation. However, by addressing the robust version of the problem, we directly address this concern by providing a flexible tool for \textit{mitigating} biases, by allowing users to tune their uncertainties and thus, providing natural ways to develop good policies even when data availability is skewed.
%Baselines: combined RL, Hawkins, random, no action. 
% some myopic strategy?

%evaluate against Hawkins here; we might not beat Hawkins w.r.t. reward but we should beat w.r.t. regret. perhaps all we need is to show that our computation speed is better than Hawkins even if performance is not better. (but Milind also said that we don't want to worry about the RL vs. MIP debate; we shouldn't have to worry about having to justify our decision to use RL)



\section{Conclusion}
% We make a key advancement in Restless Bandit modeling by introducing the robust setting and providing robust planning tools for the common real-world scenario where available data or expert knowledge is limited. To enable our approach, we develop a novel deep RL framework for learning RMAB policies, DDLPO, which demonstrates promising performance independent of our robust policy framework. While we believe these advancements have the potential to improve resource allocation in low-resource settings, they are not without tradeoffs --- for example the baseline methods we compare against, while less robust, can provide interpretable `index' policies that capture the value for acting on an arm, whereas our solution's output is not calibrated in such a way and would be difficult for a domain expert to interpret. Further, such optimization tools have the risk of amplifying underlying biases in the data and translating that to unfair resource allocation. However, by addressing the robust version of the problem, we directly address this concern by providing a flexible tool for \textit{mitigating} biases, by allowing users to tune their uncertainties and thus natural ways to develop good policies even when data availability is skewed. Ultimately, we hope these contributions bring us a step in the direction of deploying restless bandits in the real world for positive and robust impact.
We address a key limitation blocking RMABs from many real-world settings: that arm dynamics are not known precisely. To plan safe, effective policies, robust approaches accounting for uncertainty are essential, which we give in RR-DPO, enabled by DDLPO, a novel deep-RL algorithm for RMABs of general interest. We hope our contributions bring us closer to deploying RMABs for real-world impact.






\begin{contributions} % will be removed in pdf for initial submission,
                      % so you can already fill it to test with the
                      % ‘accepted’ class option
    Briefly list author contributions.
    This is a nice way of making clear who did what and to give proper credit.

    H.~Q.~Bovik conceived the idea and wrote the paper.
    Coauthor One created the code.
    Coauthor Two created the figures.
\end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    Briefly acknowledge people and organizations here.

    \emph{All} acknowledgements go in this section.
\end{acknowledgements}

\bibliography{9_bibliography}

\appendix
\input{8_appendix}

\end{document}
