% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} 
%% In your camera-ready you should use the 'accepted' parameter. This shows the authors and how an accepted paper will look like. The footer is 'Acccepted for X'. In the final version, the proceedings chairs will add the page numbers for PMLR and the final footer will be 'Proceedings of X'.
%
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros

%% beginning of transcription of abaisero.sty

\usepackage{amsmath}
\usepackage{amssymb}

% math commands

\newcommand\naturalset{\mathbb{Z}}
\newcommand\realset{\mathbb{R}}
\newcommand\kstar{^{*}}
\newcommand\kplus{^{+}}

\DeclareMathOperator*\softmax{softmax}
\DeclareMathOperator*\softmin{softmin}
\DeclareMathOperator\sign{sign}

% linalg commands

\DeclareMathOperator*\diag{diag}
\DeclareMathOperator*\rank{rank}
\DeclareMathOperator*\trace{tr}

\DeclareMathOperator*\colspace{col}
\DeclareMathOperator*\nullspace{ker}
\DeclareMathOperator*\spanspace{span}

\newcommand\T{^\top}
\newcommand\I{^{-1}}
\newcommand\PI{^{+}}
\newcommand\IT{^{-\top}}
\newcommand\PIT{^{+\top}}

% optim commands

\newcommand\opt{^{*}}
\DeclareMathOperator*\argmax{argmax}
\DeclareMathOperator*\argmin{argmin}

% stats commands

\DeclareMathOperator\Cov{\mathbb{C}}
\DeclareMathOperator\DKL{{D_\text{KL}}}
\DeclareMathOperator\Ent{\mathbb{H}}
\DeclareMathOperator\Exp{\mathbb{E}}
\DeclareMathOperator\Ind{\mathbb{I}}
\DeclareMathOperator\KL{KL}
\DeclareMathOperator\MI{\mathbb{I}}
\DeclareMathOperator\Var{\mathbb{V}}

% dists commands

\newcommand\Categorical{\operatorname{Categorical}}
\newcommand\Dirichlet{\operatorname{Dirichlet}}
\newcommand\Normal{\operatorname{Normal}}
\newcommand\Uniform{\operatorname{Uniform}}

% ml commands

\newcommand\data{{\mathcal{D}}}
\newcommand\loss{\mathcal{L}}
\DeclareMathOperator\nll{nll}
\DeclareMathOperator\mse{MSE}

% rl commands

\newcommand\aset{\mathcal{A}}
\newcommand\bset{\mathcal{B}}
\newcommand\hset{\mathcal{H}}
\newcommand\oset{\mathcal{O}}
\newcommand\rset{\mathcal{R}}
\newcommand\sset{\mathcal{S}}

\newcommand\dfn{\mathrm{D}}
\newcommand\gfn{\mathrm{G}}
\newcommand\ofn{\mathrm{O}}
\newcommand\rfn{\mathrm{R}}
\newcommand\tfn{\mathrm{T}}

\newcommand\nohistory{\varepsilon}

\newcommand\policy{\pi}

\newcommand\qpolicy{Q^\policy}
\newcommand\qmodel{\hat Q}

\newcommand\vpolicy{V^\policy}
\newcommand\vmodel{\hat V}

\newcommand\upolicy{U^\policy}
\newcommand\umodel{\hat U}

% misc options

\newcommand\iter[1]{^{(#1)}}

%% end of abaisero.sty

\usepackage{todonotes} % TODO remove eventually
\usepackage[inline]{enumitem}
\usepackage{amsthm}
\usepackage{cleveref}
\usepackage{subcaption}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{xr}
\externaldocument{baisero_636-supp}

% note: amsthm must be loaded before cleveref, but the theorems must be defined after cleveref.
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

% useful for pseudocode lines which are too long
\newcommand{\algparbox}[1]{\parbox[t]{\dimexpr\linewidth-\algorithmicindent}{#1\strut}}

\newcommand\qset{\mathcal{Q}}
\newcommand\uset{\mathcal{U}}
\newcommand\Ppolicy{P_{\policy}}
\newcommand\Bpolicy{B_{\policy}}
\newcommand\Bpolicyopt{B_{\policy\opt}}

% To make just enough space for some of the longer equation, I've renamed "stop" to "SG" for "stop-gradient" (I think I've seen this somewhere before)
\DeclareMathOperator{\Stop}{SG}
\newcommand\qloss{\loss_{\qmodel}}
\newcommand\uloss{\loss_{\umodel}}

% \renewcommand\paragraph[1]{\noindent\textbf{#1}\;}
\let\oldqmodel\qmodel
\renewcommand\qmodel{\smash{\oldqmodel}}
\let\oldumodel\umodel
\renewcommand\umodel{\smash{\oldumodel}}
\newcommand\pmodel{\hat\policy}

\newcommand\envlabel[1]{\textbf{#1}}
\newcommand\heavenhell{\envlabel{Heaven-Hell}}
\newcommand\heavenhellthree{\envlabel{Heaven-Hell-3}}
\newcommand\heavenhellfour{\envlabel{Heaven-Hell-4}}
\newcommand\carflag{\envlabel{Car-Flag}}
\newcommand\cleaner{\envlabel{Cleaner}}
\newcommand\gvmemoryfourrooms{\envlabel{GV-MemoryFourRooms-7x7}}

\newcommand\algolabel[1]{\textbf{#1}}
\newcommand\dqn{\algolabel{DQN}}
\newcommand\adqn{\algolabel{ADQN}}
\newcommand\adqnvr{\algolabel{ADQN-VR}}
\newcommand\adqnstate{\algolabel{ADQN-State}}
\newcommand\adqnstatevr{\algolabel{ADQN-State-VR}}

\title{Asymmetric DQN for Partially Observable Reinforcement Learning}


% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is automatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Important:  case of equal contributions, we strongly recommend to NOT show it in this part of the paper, but rather describe it in the appropriate section at the end of the paper "Author Contribution", where you have more space to describe how each author contributed.
%
% Add authors
% Remember to use the order convention "First/Given name" "Last/Family name", e.g. John Smith, Hanako Yamada, Marco Rossi, Wei Zhang
\author[1]{\href{mailto:<baisero.a@northeastern.edu>?Subject=Your UAI 2022 paper}{Andrea~Baisero}{}}
\author[1]{Brett~Daley}
\author[1]{Christopher~Amato}
% Add affiliations after the authors
\affil[1]{%
    Khoury College of Computer Sciences\\
    Northeastern University\\
    Boston, Massachusetts, USA
}
  
\begin{document}
\maketitle

\begin{abstract}
	Offline training in simulated partially observable environments allows reinforcement learning methods to exploit privileged state information through a mechanism known as asymmetry.
	Such privileged information has the potential to greatly improve the optimal convergence properties, if used appropriately.
	However, current research in asymmetric reinforcement learning is often heuristic in nature, with few connections to underlying theory or theoretical guarantees, and is primarily tested through empirical evaluation.
	In this work, we develop the theory of \emph{Asymmetric Policy Iteration}, an exact model-based dynamic programming solution method, and then apply relaxations which eventually result in \emph{Asymmetric DQN}, a model-free deep reinforcement learning algorithm.
	Our theoretical findings are complemented and validated by empirical experimentation performed in environments which exhibit significant amounts of partial observability, and require both information gathering strategies and memorization.
\end{abstract}

\section{Introduction}

Offline training and online execution (OTOE) is a modern reinforcement learning (RL) paradigm in which a learning agent is trained \emph{offline} (i.e., in simulation) before becoming operational \emph{online} (i.e., in the ``real'' environment).  Advantages of OTOE are broad and include safety guarantees, training speed, flexibility, and---the focus of our work---access to privileged information.  For all these reasons, OTOE has even become the paradigm of preference in some research cliques, such as that of multi-agent RL, where it is often called centralized training and decentralized execution (CTDE).
%
Privileged information is data which is accessible during offline training, but not during standard online training and/or execution.  This can take different forms depending on the type of control problem, e.g., other agents' actions and observations in multi-agent RL, or the system's state in partially observable RL (PORL).  In OTOE, access to this information is a temporary privilege, available exclusively in the offline phase due to access of the simulation's internal state.  However, despite being not available during online execution, such information has the potential (when used appropriately) to improve the agent's overall training performance and/or convergence speed, and therefore its online performance.

In PORL, OTOE and privileged information is most commonly associated with actor-critic methods through a mechanism called \emph{asymmetry}.  Asymmetry has a very specific etymological meaning described below;  however, we use the term ``asymmetry'' more loosely to also refer to the general idea of exploiting privileged information during offline training---two concepts which overlap strongly in this work.
%
In actor-critic methods, two separate models are being trained:  a \emph{policy} model (representing the agent's behavior) and a \emph{critic} model (representing the agent's evaluation of its situation).  Standard actor-critic methods can be said to be implicitly \emph{symmetric} in the sense that both models receive the same information---in PORL, the agent's history.  In \emph{asymmetric} actor-critic, this symmetry is broken by providing privileged information to the critic~\citep{pinto_asymmetric_2018,foerster_counterfactual_2018,lowe_multi-agent_2017,yang_cm3_2018,li_robust_2019,wang_r-maddpg_2020,warrington_robust_2021,xiao_local_2021,baisero_unbiased_2022,lyu_deeper_2022}.  This is possible because the critic is exclusively a training construct which is not used or needed during the execution phase.
%
Asymmetry has also been used in some DQN-like RL methods~\citep{rashid_qmix_2018,mahajan_maven_2019,rashid_weighted_2020,xiao_learning_2020,de_witt_deep_2020}, where normally there would not be a secondary model analogous to the critic which is only used during training.  In such cases, a second value-based model is introduced exclusively as a means for asymmetry, and which constitutes a training construct analogous to that of the critic in actor-critic.

However, a substantial majority of prior work in asymmetric RL has proposed heuristic forms of asymmetry primarily verified through empirical evaluations, but which lack the support of a theoretical framework which guarantees the state information is used in an appropriate fashion.  We argue that, if unverified by proper theoretical analysis, such methods could quite simply make use of state information in ways which actually hinders the training of a partially observable agent.  For example, a well-known result in partially observable control is that the optimal action for a partially observable agent can differ greatly from that of an optimal fully observable agent, and that an optimal partially observable agent might even take actions which an optimal fully observable agent would never take under any circumstance, e.g., information-gathering actions that help the partially observable agent learn something about the environment state, but do not help the fully observable agent.

The ultimate goal of this work is to develop a state-of-the-art asymmetric value-based deep RL algorithm for partially observable control that is supported by a sound theoretical analysis.  To reach this goal, we employ a bottom-up approach, focusing first on developing the theory of \emph{asymmetric policy improvement}, i.e., mechanisms through which privileged state information can be integrated into a policy improvement process while retaining optimal convergence guarantees.  In practice, we begin by developing \emph{Asymmetric Policy Iteration} (API) and \emph{Asymmetric Action-Value Iteration}, model-based dynamic programming solution methods.  We then introduce elements of stochastic training from sample experience which result in \emph{Asymmetric Q-Learning} (AQL), a direct RL successor to API and AAVI.  Finally, we introduce value-function approximation which results in \emph{Asymmetric DQN} (ADQN), a method comparable to other state-of-the-art deep RL algorithms, but which is also capable of exploiting state informationin a principled fashion.  To the best of our knowledge, our work is the first to develop theoretically-driven asymmetric value-based RL.

\section{Related Work}

Privileged information available offline has been used to improve training performances in a wide range of prior single-agent and multi-agent methods which include both policy-based and value-based methods.
%policy-based~\citep{pinto_asymmetric_2018,foerster_counterfactual_2018,lowe_multi-agent_2017,yang_cm3_2018,li_robust_2019,wang_r-maddpg_2020,de_witt_deep_2020,warrington_robust_2021,xiao_local_2021,baisero_unbiased_2022,lyu_deeper_2022} and value-based methods~\citep{rashid_qmix_2018,mahajan_maven_2019,rashid_weighted_2020,de_witt_deep_2020}.

In sigle-agent control,
%
\cite{pinto_asymmetric_2018} employ DDPG with an asymmetric state-based critic to handle robot manipulation tasks;
%
belief-grounded networks~\citep{nguyen_belief-grounded_2021} uses a belief-based form of asymmetry and a belief-reconstruction task to train the history representation;
%
\cite{warrington_robust_2021,chen_learning_2020} use imitation learning to train a partially observable agent via a fully observable agent trained offline.
%
\cite{baisero_unbiased_2022} show theoretical issues with state-only forms of asymmetry for policy-gradients, and develop a history-state variant.

In multi-agent control,
%
COMA~\citep{foerster_counterfactual_2018} uses a single centralized asymmetric critic which employs the joint observations and/or the environment state.
%
MADDPG~\citep{lowe_multi-agent_2017} and M3DDPG~\citep{li_robust_2019} use multiple centralized asymmetric critics, one for each agent, which employ the joint observations and/or the environment state.
%
R-MADDPG~\citep{wang_r-maddpg_2020} uses a recurrent model and a centralized critic which uses the entire histories of all agents;
%
CM3~\citep{yang_cm3_2018} uses a state-only critic for reactive control;
%
MacDec-DDRQN~\citep{xiao_learning_2020} uses a centralized value model to learn individual centralized value models.
%
ROLA~\citep{xiao_local_2021} uses both centralized and individual asymmetric critics which employ local history and/or state information to estimate individual advantage values.
%
QMIX~\citep{rashid_qmix_2018}, MAVEN~\citep{mahajan_maven_2019}, and WQMIX~\citep{rashid_weighted_2020} use a centralized but factored value model to train individual agent value models.
%
\cite{lyu_deeper_2022} extend the theory by \cite{baisero_unbiased_2022} to the multi-agent case.

%\todo[inline]{check out Sunehag, Peter, Lever, Guy, Gruslys, Audrunas, Czarnecki, Wojciech Marian, Zambaldi, Vinicius, Jaderberg, Max, Lanctot, Marc, Sonnerat, Nicolas, Leibo, Joel Z., Tuyls, Karl, and Graepel, Thore. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward. In Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems, 2017.}
% 
% Unfortunately, a majority of this prior work is purely empirical in nature, is accompanied with little theory to justify methods which are instead validated exclusively through empirical evaluation in selected environments.  As an exception to this trend, the work by \cite{baisero_unbiased_2022} and \cite{lyu_deeper_2022} has focused on developing variants of asymmetric actor-critic grounded in sound theory in both single-agent and multi-agent control.  In this work, we will bridge the gap with value-based RL and develop the theory of asymmetric value-based methods.

\section{Background}

In the next subsection, we present some of the background required to understand our work.
%
\Cref{sec:pomdp} formally describes partially observable control problems.
%
\Cref{sec:dqn} contains a review of non-asymmetric value-based control, in the form of DQN.
%
\Cref{sec:u} covers the definition of \emph{history-state} value functions.
%
In \Cref{sec:operators}, we present operator notation and useful operators.

\subsection{POMDPs}\label{sec:pomdp}

A partially observable Markov decision process (POMDP) is a discrete-time control problem represented by tuple $\langle \sset, \aset, \oset, b_0, T, O, R, \gamma \rangle$, where
%
\begin{enumerate*}[label=(\alph*)]
	%
	\item $\sset$, $\aset$, and $\oset$ are state, action, and observation spaces,
	      %
	\item $b_0\in\Delta\sset$ is an initial state distribution,
	      %
	\item $T\colon\sset\times\aset\to\Delta\sset$ is a stochastic state transition function,
	      %
	\item $O\colon\sset\times\aset\times\sset\to\Delta\oset$ is a stochastic observation emission function,
	      %
	\item $R\colon\sset\times\aset\to\realset$ is a reward function, and
	      %
	\item $\gamma\in[0, 1)$ is a discount factor.
	      %
\end{enumerate*}

Partially observable control is based on observable \emph{histories}, i.e., the sequences of past actions and observations.  The \emph{history} space $\hset\doteq (\aset\times\oset)^*$ represents such sequences.
%
To simplify notation, we overload symbol $R$ to also denote the expected reward function on histories $R(h, a) \doteq \Exp_{s\mid h}\left[ R(s, a) \right]$.
%
General partially observable policies take the form of mappings from \emph{histories} to action distributions $\policy\colon\hset\to\Delta\aset$;  however, in this work, we will focus exclusively on \emph{deterministic} policies $\policy\colon\hset\to\aset$.  The goal of the control problem is to find a policy which maximizes the episodic expected \emph{return} $\Exp\left[ \sum_t \gamma^t R(s_t, a_t) \right]$.

Every policy $\policy$ is associated with an action-value function $\qpolicy(h,a)$ which represents the expected return associated with the agent having observed history $h$, taking action $a$, and then continuing to behave according to the policy $\policy$.
%
$\qpolicy$ is the unique solution to the Bellman equation,
%
\begin{equation}
	%
	\qpolicy(h, a) = R(h, a) + \gamma\Exp_{o\mid h,a}\left[ \qpolicy(hao, \policy(hao)) \right] \,. \label{eq:q:bellman}
	%
\end{equation}

The action-value function associated with the optimal policy $\policy\opt$ is denoted as $Q\opt$, and is the unique solution to the Bellman \emph{optimality} equation,
%
\begin{equation}
	%
	Q\opt(h, a) = R(h, a) + \gamma\Exp_{o\mid h,a}\left[ \max_{a'} Q\opt(hao, a') \right] \,. \label{eq:q:bellman:opt}
	%
\end{equation}

\paragraph{Notation}
%
We use symbols $Q, \qpolicy, Q\opt$, and $\qmodel$ to denote similar but separate concepts.
%
$Q\colon \hset\times\aset\to\realset$ denotes an arbitrary real-valued function, not necessarily associated with any policy, $\qpolicy$ denotes the value function associated with a policy, $Q\opt$ denotes the value function associated with an optimal policy, while $\qmodel$ denotes a (deep) parametric model.  We denote the space of all such history-action real functions as  $\qset \doteq \{ Q \mid Q\colon \hset\times\aset\to\realset \}$.
%
Further, we use $g(Q)$ to denote the policy which acts greedily based on $Q$, i.e., if $\policy = g(Q)$, then $\policy(h) = \argmax_a Q(h, a)$.

\subsection{DQN}\label{sec:dqn}

Deep Q-Network (DQN)~\citep{mnih_human-level_2015} is a highly successful algorithm for training deep neural networks to control high-dimensional fully-observable Markov decision processes (MDPs) based on reward feedback, and the first to achieve human-level performance on a majority of the Atari 2600 games.
%
Rather than relying on a lookup table to track the estimated expected return for each state-action pair $(s,a)$, DQN learns a parametric function $\qmodel \colon \sset \times \aset \to \realset$ to generalize over state-action pairs.
The algorithm reformulates the incremental Q-Learning update~\citep{watkins_learning_1989} as a squared-error minimization problem,
%
\begin{equation}
	%
	\loss(s, a, r, s') \doteq \left( r + \gamma \max_{a' \in \aset} \qmodel(s',a';\theta^-) - \qmodel(s,a;\theta) \right)^2 \,,
	%
\end{equation}
%
\noindent where $\theta$ is a set of parameters and $\theta^-$ is a time-delayed copy of $\theta$ to stabilize learning.
The agent interacts with the environment and stores observed transitions $(s,a,r,s')$ in a replay memory, periodically updating $\theta$ via gradient descent on randomly sampled minibatches of experience~\citep{lin_self-improving_1992}.
This approach approximates the i.i.d.\ supervised training setting commonly used for neural networks and required for first-order optimization methods.

\paragraph{Adapting DQN to Partially Observable Control}
The DQN algorithm was primarily designed for fully observable control problems represented as MDPs.
Nonetheless, as with many other model-free RL algorithms, generalization to partially observable control is conceptually straightforward, and achievable by replacing state variables with history variables in the relevant equations, and by employing architectures capable of processing history data.
\emph{Frame stacking}, i.e., the practice of concatenating a small number of recent observations, has been found to be sufficient to tackle problems which feature small amounts of partial observability~\citep{mnih_human-level_2015}.
On the other hand, larger amounts of partial observability generally require longer-term memorization capabilities.
For such problems, the standard choice has become that of combining the DQN training algorithm with a recurrent neural network component used to process history data, also known as \emph{Deep Recurrent Q-Network} (DRQN)~\citep{hausknecht_deep_2015}.
Although some practitioners use the DQN label exclusively to indicate the variant which lacks a recurrent component, our view is that the essence of the DQN algorithm is in its training regime and its losses, rather than the details of which architecture is used.
Therefore, in this document, we use the label DQN more broadly to encompass all architectural variants.
In practice, because our work focuses on control problems which feature large amounts of partial observability, we employ appropriate algorithmic and modeling choices, i.e., all methods and baselines employ a history-based model $\qmodel$, and all models which receive history data employ a recurrent network component to process it.

\subsection{History-State Value Functions}\label{sec:u}

Recent theoretical work in asymmetric actor-critic for PORL has employed the notion of a history-state value function $\upolicy(h, s, a)$~\citep{baisero_unbiased_2022,lyu_deeper_2022}, which represents the expected return associated with the agent having observed history $h$, the environment being in state $s$, taking action $a$, and then continuing to behave according to the policy $\policy$.  $\upolicy$ is the unique solution to the \emph{history-state} Bellman equation,
%
\begin{equation}
	%
	\upolicy(h, s, a) = R(s, a) + \gamma\Exp_{s',o\mid s,a} \left[ \upolicy(hao, s', \pi(hao)) \right] \,. \label{eq:u:bellman}
	%
\end{equation}

Despite using the state context to represent a more informed measure of the agent's expected return, $\upolicy$ still relates to a partially observable agent which is unable to exploit that privileged information, i.e., the state determines future rewards, observations, and states, but it does not directly determine future actions, which are rather determined indirectly by the history.
%
$\upolicy$ is related to $\qpolicy$ via a simple identity,
%
\begin{equation}
	%
	\qpolicy(h, a) = \Exp_{s\mid h}\left[ \upolicy(h, s, a) \right] \,. \label{eq:q}
	%
\end{equation}

We denote the history-state value function associated with the optimal policy as $U\opt$.  Once again, this notion of optimality is relative to the space of partially observable policies.  Among other things, this means that an optimal partially observable policy cannot be recovered by maximizing $U\opt$, i.e., generally, there is no guarantee that $\policy\opt(h) = \argmax_a U\opt(h, s, a)$ for any given value of $s$.

\paragraph{Notation}
%
We use symbols $U, \upolicy, U\opt$, and $\umodel$ to denote similar but separate concepts.
%
$U\colon \hset\times\sset\times\aset\to\realset$ denotes an arbitrary real-valued function, not necessarily associated with any policy, $\upolicy$ denotes the value function associated with a policy, $U\opt$ denotes the value function associated with an optimal policy, while $\umodel$ denotes a (deep) parametric model.  We denote the space of all such history-state-action real functions as  $\uset \doteq \{ U \mid U\colon \hset\times\sset\times\aset\to\realset \}$.

\subsection{Operator Notation}\label{sec:operators}

To simplify the upcoming math, we make extensive use of operator notation for mappings between $Q$ and $U$ functions.

Operator $\Bpolicy\colon\qset\to\qset$ is the Bellman operator defined as $\Bpolicy Q(h, a) \doteq R(h, a) + \gamma \Exp_{o\mid h, a}\left[ Q(hao, \policy(hao)) \right]$, with which \Cref{eq:q:bellman} can be rewritten as $\qpolicy = \Bpolicy \qpolicy$.
%
\begin{lemma}\label{thm:bpolicy:q}
	%
	Operator $\Bpolicy$ is a contraction with fixed point $\qpolicy$ (proof in \Cref{sec:proof:thm:bpolicy:q}.)
	%
\end{lemma}

Operator $B\colon\qset\to\qset$ is the Bellman \emph{optimality} operator defined as $BQ(h, a) \doteq R(h, a) + \gamma \Exp_{o\mid h, a}\left[ \max_{a'} Q(hao, a') \right]$ or, equivalently, $B\colon Q\mapsto B_{g(Q)} Q$, and with which \Cref{eq:q:bellman:opt} can be rewritten as $Q\opt = BQ\opt$.
%
\begin{lemma}\label{thm:b:q}
	%
	Operator $B$ is a contraction with fixed point $Q\opt$ (proof in \Cref{sec:proof:thm:b:q}.)
	%
\end{lemma}

Some operators for $U$ are analogous to those for $Q$.  To avoid introducing a separate set of symbols for such cases, we overload the previously defined symbols to include these new meanings;  the distinction will remain clear from context, usually as the type of the operator's input/output.

Operator $\Bpolicy\colon\uset\to\uset$ is the Bellman operator defined as $\Bpolicy U(h, s, a) \doteq R(s, a) + \gamma \Exp_{s', o\mid s, a}\left[ U(hao, s', \policy(hao)) \right]$, with which \Cref{eq:u:bellman} can be rewritten as $\upolicy = \Bpolicy \upolicy$.
%
\begin{lemma}\label{thm:bpolicy:u}
	%
	Operator $\Bpolicy$ is a contraction with fixed point $\upolicy$ (proof in \Cref{sec:proof:thm:bpolicy:u}.)
	%
\end{lemma}

Operator $E\colon\uset\to\qset$ converts $U$ functions to $Q$ functions by taking the conditional expectation over states, and is defined as $EU(h, a) \doteq \Exp_{s\mid h}\left[ U(h, s, a) \right]$.

\begin{definition}[Mutual Consistency]
	%
	We say that functions $Q$ and $U$ are \emph{mutually consistent} iff $Q = EU$ holds.
	%
\end{definition}

\section{Asymmetric Value-Based PORL}

In this section, we present the core of our theoretical and algorithmic contributions, which focus on computing or learning optimal action-values $Q\opt(h, a)$ by means of asymmetry.
%
In \Cref{sec:api} we present \emph{Asymmetric Policy Iteration} (API), a solution method for tabular models with optimal convergence guarantees.
%
In \Cref{sec:aavi} we present \emph{Asymmetric Action-Value Iteration} (AAVI), an eager variant of API with similar optimal convergence guarantees.
%
In \Cref{sec:aql} we relax aspects of AAVI to make it suitable for learning by means of sample experience, and present \emph{Asymmetric Q-Learning} (AQL).
%
In \Cref{sec:adqn} we introduce value function approximation to improve generalization, and present \emph{Asymmetric DQN} (ADQN), and other related variants.

\paragraph{Introducing Asymmetry to Value-Based Methods}
%
Two fundamental issues make the use of state information in value-based methods not directly possible:
%
\begin{enumerate*}[label=(\alph*)]
	%
	\item because an action-value model $\qmodel(h, a)$ is eventually used for online control, it is constrained by the control problem and cannot directly employ privileged state information; and
	      %
	\item typical value-based methods do not feature a separate model for the purpose of offline training which may access privileged information (akin to the critic in actor-critic).
	      %
\end{enumerate*}
%
As such, value-based methods seem fundamentally incompatible with the notion of asymmetry and the use of privileged information.
%
We resolve both issues, and introduce an auxiliary history-state model $\umodel(h, s ,a)$, trained to model the optimal history-state value function $U\opt(h, s, a)$, and used exclusively as a training construct through which to implement asymmetry.  Our goal is to train $\umodel$ and $\qmodel$ jointly so as to converge to the optimal value functions $U\opt$ and $Q\opt$.

\subsection{Asymmetric Policy Iteration}\label{sec:api}

Consider \emph{Asymmetric Policy Iteration} (API), an iterative process analogous to Policy Iteration~\citep{sutton_reinforcement_2018} which employs both history-state and history values to implement asymmetry.  API starts from arbitrary initial values and policy $U_0$, $Q_0$, and $\policy_0$, and then uses the following update rules to generate sequences $U_k$, $Q_k$, and $\policy_k$,
%
\begin{align}
	%
	U_{k+1}       & \gets \lim_{n\to\infty} B_{\policy_k}^n U_k \,, & \text{(U-evaluation)} \label{eq:api:ustep} \\
	%
	Q_{k+1}       & \gets E U_{k+1} \,,                             & \text{(Q-evaluation)} \label{eq:api:qstep} \\
	%
	\policy_{k+1} & \gets g(Q_{k+1}) \,.                            & \text{(improvement)} \label{eq:api:pstep}
	%
\end{align}
%
The U-evaluation step can be practically implemented as the solution to the system of equations $U_{k+1} = R + \gamma P_{\pi_k} U_{k+1}$, or by using $B_{\policy_k}$ until convergence (see \Cref{algo:api}).

\begin{theorem}[API Optimality]\label{thm:api:optimality}
	%
	The sequences $U_k$, $Q_k$, and $\policy_k$ generated by API converge to $U\opt$, $Q\opt$, and $\policy\opt$.
	%
\end{theorem}
%
\begin{proof}
	%
	By \Cref{thm:bpolicy:q}, $U_{k+1}$ equals the fixed point of $B_{\policy_k}$, i.e., $U_{k+1} = U^{\policy_k}$.
	Then, by \Cref{eq:q}, $Q_{k+1} = EU^{\policy_k} = Q^{\policy_k}$ and consequently $\policy_{k+1} = g(Q^{\policy_k})$.
	Therefore, in each iteration and until $\policy\opt$ is reached, the next policy $\policy_{k+1}$ is a strict improvement on the previous policy $\policy_k$ (Policy Improvement Theorem, \citep{sutton_reinforcement_2018}).
	Let $k\opt$ be the smallest index such that $\policy_{k\opt}$ is optimal;
	for $k > k\opt$, we conclude that $U_k = U\opt$ and $Q_k = Q\opt$.
	%
\end{proof}

% The U-evaluation step can be practically implemented as the solution to the system of equations $U_{k+1} = R + \gamma P_{\pi_k} U_{k+1}$, or by using $B_{\policy_k}$ until convert by mapping  $U_k$ through $B_{\policy_k}$ until convergence (see \Cref{algo:api}).

\paragraph{Limitations}
%
While API is formally guaranteed to converge optimally, it also has significant practical limitations:
%
\begin{enumerate*}[label=(\alph*)]
	%
	\item API is a solution method which requires a model of the environment, as well as efficient and accurate methods to compute the expectations in the U-evaluation and the Q-evaluation steps.
	      %
	\item A practical approximation of the limit operator in the U-evaluation step (see \Cref{algo:api}) might itself require multiple iterations to achieve an adequate precision.
	      %
	\item API requires tabular models $U$ and $Q$, which is not only impractical given that the space of histories grows exponentially with episode lengths, but also makes it not applicable to control problems which have continuous observations or states.
	      %
	\item Perhaps most importantly, API does not offer any significant advantage compared to its non-asymmetric counterpart PI.  Ultimately, both API and PI converge to the same optimal value function $Q\opt$;  if anything, API requires more memory and computation to achieve the same goal, resulting in a less practical solution method.
	      %
\end{enumerate*}

\paragraph{Why API?}
%
In light of the above limitations, particularly the last one, what is then the purpose of API?  We argue that API plays two crucial roles:
%
\begin{enumerate*}[label=(\alph*)]
	%
	\item The first is to show that privileged and asymmetric information such as the system state \emph{can} be properly included into a value-based solution process while maintaining formal optimality guarantees.  This theoretical aspect is often overlooked in modern asymmetric RL research, which instead tends to focus on heuristic methods and empirical results, and API represents the first theoretical guarantee of this kind for value-based RL.
	      %
	\item The second is to serve as a basis for other algorithms which do provide practical advantages compared to their non-asymmetric counterparts.  Starting from the next subsection, we relax various aspects of API and develop asymmetric value-based algorithms which address each of API's limitations.
	      %
\end{enumerate*}

\begin{algorithm}[t]
	%
	\caption{Asymmetric Policy Iteration (API)}\label{algo:api}
	%
	\begin{algorithmic}[1]
		%
		\Require{$U_0$, $Q_0$, $\policy_0$ arbitrarily initialized tabular models.}
		%
		\Ensure{$\lim_{k\to\infty}\{U_k,Q_k,\policy_k\} = \{U\opt,Q\opt,\policy\opt\}$.}
		%
		\For{$k\gets 0, 1, 2, 3, \ldots$}
		%
		\State{$U_{k+1} \gets U_k$}
		%
		\Repeat
		%
		\State{$U_{k+1} \gets B_{\policy_k} U_{k+1}$}
		%
		\Until{convergence}
		%
		\State{$Q_{k+1} \gets EU_{k+1}$}
		%
		\State{$\policy_{k+1} \gets g(Q_{k+1})$}
		%
		\EndFor
		%
	\end{algorithmic}
	%
\end{algorithm}

\subsection{Asymmetric Action-Value Iteration}\label{sec:aavi}

\begin{algorithm}[t]
	%
	\caption{Asymmetric Action-Value Iteration (AAVI)}\label{algo:aavi}
	%
	\begin{algorithmic}[1]
		%
		\Require{$U_0$, $Q_0$ arbitrarily initialized tabular models.}
		%
		\Ensure{$\lim_{k\to\infty} \{U_k,Q_k\} = \{U\opt,Q\opt\}$.}
		%
		\For{$k\gets 0, 1, 2, 3, \ldots$}
		%
		\State{$U_{k+1} \gets B_{g(Q_k)} U_k$}
		%
		\State{$Q_{k+1} \gets EU_{k+1}$}
		%
		\EndFor
		%
	\end{algorithmic}
	%
\end{algorithm}

The first limitation of API which we address is the presence of the limiting operator in its U-step, which makes practical implementations approximate, and/or inefficient.
%
To this end, consider \emph{Asymmetric Action-Value Iteration} (AAVI), an eager variant of API which uses the following updates,
%
\begin{align}
	%
	U_{k+1} & \gets B_{g(Q_k)} U_k \,, & \text{(U-evaluation)} \label{eq:aavi:ustep} \\
	%
	Q_{k+1} & \gets E U_{k+1} \,.      & \text{(Q-evaluation)} \label{eq:aavi:qstep}
	%
\end{align}
%
\noindent Compared to API, the improvement step has been folded in the U-evaluation step, removing the need for an explicit policy representation.  Further, the U-evaluation step has been simplified to apply operator $B_{g(Q_k)}$ a single time, making for a simple, faster, and more practical implementation (see \Cref{algo:aavi}) without compromising optimality guarantees.  Both aspects make AAVI analogous to Value Iteration~\citep{sutton_reinforcement_2018}, with the primary differences being the use of action-values and asymmetry.

\begin{lemma}[Asymmetric Bellman Equivalence]\label{thm:asym-bellman-equivalence}
	%
	For mutually consistent $U$ and $Q$, the identity $E B_{g(Q)} U = BQ$ holds (proof in \Cref{sec:proof:thm:asym-bellman-equivalence}.)
	%
\end{lemma}

\begin{theorem}[AAVI Optimality]\label{thm:aavi:optimality}
	%
	The sequences $U_k$ and $Q_k$ generated by AAVI converge to $U\opt$ and $Q\opt$.
	%
\end{theorem}
%
\begin{proof}
	%
	We can combine the U-evaluation and Q-evaluation steps, and then use \Cref{thm:asym-bellman-equivalence} to obtain
	$Q_{k+1} = EU_{k+1} = EB_{g(Q_k)} U_k = BQ_k$.
	By induction, $Q_k = B^k Q_0$, which converges to the fixed point of $B$: i.e., $\lim_{k\to\infty} Q_k = Q\opt$.
	This guarantees the existence of some iteration $k\opt$ such that $g(Q_k) = \policy\opt, \forall k\ge k\opt$.
	Therefore, $U_k = \Bpolicyopt^{k-k\opt} U_{k\opt}, \forall k \ge k\opt$, and $U_k$ converges to the fixed point of $\Bpolicyopt$: i.e., $\lim_{k\to\infty} U_k = U\opt$.
	%
\end{proof}

\subsection{Asymmetric Q-Learning}\label{sec:aql}

\begin{algorithm}[t]
	%
	\caption{Asymmetric Q-Learning (AQL)}\label{algo:aql}
	%
	\begin{algorithmic}[1]
		%
		\Require{$U$, $Q$ mutually consistent tabular models.}
		%
		\Ensure{$\{U,Q\} \to \{U\opt,Q\opt\}$.}
		%
		\While{True}
		%
		\State{Initialize history and state $(h, s)$}
		%
		\While{$s$ is not terminal}
		%
		\State{Choose action $a$ from $\epsilon$-greedy policy on $Q$}
		%
		\State{Take action $a$, observe $r, s', o$}
		%
		\State{$y \gets r + \gamma U(hao, s', \argmax_{a'} Q(hao, a'))$}
		%
		\State\algparbox{$U(h, s, a) \gets (1-\alpha) U(h, s, a) + \alpha y$}
		%
		\State\algparbox{$Q(h, a) \gets (1-\alpha) Q(h, a) + \alpha y$}
		%
		\State{$(s, h) \gets (s', hao)$}
		%
		\EndWhile
		%
		\EndWhile
		%
	\end{algorithmic}
	%
\end{algorithm}

Like all dynamic programming methods, API and AAVI make extensive and often unrealistic assumptions like the model of the environment and being able to compute exact expectations.
To bypass many of these requirements, we can employ incremental stochastic updates based on sequentially sampled transitions.
We call this new method \emph{Asymmetric Q-Learning} (AQL), as it generalizes the iterative Q-Learning algorithm~\citep{watkins_learning_1989} to asymmetric PORL.

To handle the randomness induced by the sample transitions, the algorithm must average over the samples using a variable stepsize parameter $\alpha_k \in [0,1]$.
At each iteration $k$, the agent samples a transition $(h_k, s_k, a_k, r_k, s_{k+1}, o_k)$ and conducts AAVI-like updates\footnote{
	The Q-evaluation step of AAVI (\Cref{eq:aavi:qstep}) can equivalently be expressed as $Q_{k+1} \gets EB_{g(Q_k)} U_k$, which is the form AQL employs to guarantee optimal convergence.
}
on the respective entries of $U_k$ and $Q_k$.

% \begin{align}
% %
% %
% U_{k+1} &\gets (1-\alpha_k) U_k + \alpha_k Y_k \,, \\
% %
% Q_{k+1} &\gets (1-\alpha_k) Q_k + \alpha_k Z_k \,.
% %
% \end{align}

For notational brevity, we first define the following targets:
\begin{align}
	Y_k(h, s, a) & \doteq \begin{cases}
		                      \left( B_{g(Q_k)} U_k + w_k \right) (h, s, a)
		                                   & \text{for $(h_k, s_k, a_k)$} \\
		                      U_k(h, s, a) & \text{otherwise}
	                      \end{cases} \,,                         \\
	Z_k(h, a)    & \doteq \begin{cases}
		                      \left( EB_{g(Q_k)} U_k + v_k \right) (h, a) & \text{for $(h_k, a_k)$} \\
		                      Q_k(h, a)                                   & \text{ otherwise}
	                      \end{cases} \,.
\end{align}
Here, $w_k \in \uset$ and $v_k \in \qset$ are zero-mean noise processes that represent the randomness in the environment and action selection at iteration $k$.
AQL then conducts the following updates based on the stochastic targets $Y_k$ and $Z_k$:
\begin{align}
	\label{eq:aql_update_u}
	U_{k+1} & \gets U_k + \alpha_k (Y_k - U_k) \,,                  \\
	% 
	\label{eq:aql_update_q}
	Q_{k+1} & \gets Q_k + \alpha_k (Z_k - Q_k) \Pr(s_k\mid h_k) \,.
\end{align}
Note that the targets $Y_k$ and $Z_k$ are defined elementwise such that only one entry of $U_k$ and $Q_k$---the one associated with $(h_k, s_k, a_k)$---is updated for any given index $k$.
When the stepsizes $\alpha_k$ are annealed towards zero at an appropriate rate, AQL converges optimally despite the noisy updates.
\begin{theorem}[AQL Optimality]\label{thm:aql:optimality}
	%
	Assume stepsizes $\alpha_k$ satisfying the following asymptotic conditions,
	%
	\begin{align}
		%
		\sum_{k=0}^\infty \alpha_k   & = \infty \,, &
		%
		\sum_{k=0}^\infty \alpha_k^2 & < \infty \,.
		%
	\end{align}
	%
	If $Q_0, U_0$ are mutually consistent ($Q_0 = E U_0$), then the sequences $Q_k$ and $U_k$ generated by AQL converge to $Q\opt$ and $U\opt$ with probability 1 (proof in \Cref{sec:aql_proof}.)
	%
\end{theorem}
%
% \textit{Proof sketch.}
% The proof sketch follows similarly to that of AAVI, with some modifications to show that the sampling noise is negligible in the limit.
% Let $p_k \in [0,1]$ be the conditional probability of sampling $(h_k,s_k,a_k)$ at iteration $k$.
% Analogously to the proof of Proposition~1 of \cite{tsitsiklis2002convergence}, we refactor \Cref{eq:aql_update_u,eq:aql_update_q} to remove their conditional statements by defining new noise processes $w'_k  \in \uset$ and $v'_k \in \qset$ such that
% \begin{align*}
%     U_{k+1} &= (1-\alpha_k p_k) U_k + \alpha_k p_k (B_{g(Q_k)} U_k + w'_k)
%     ,\\
%     % 
%     Q_{k+1} &= (1-\alpha_k p_k) Q_k + \alpha_k p_k (E B_{g(Q_k)} U_k + v'_k)
%     .
% \end{align*}
% By design, $w'_k$ and $v'_k$ are zero mean and their conditional variances are bounded such that standard convergence results for contraction mappings apply (e.g., Proposition~4.4 of \cite{bertsekas}).
% It follows that $Q_k$ and $U_k$ converge to $Q^*$ and $U^*$ with probability 1.
% 
% \todo[inline]{alternative ending}

Factor $\Pr(s_k\mid h_k)$ is necessary to ensure that $U_k$ and $Q_k$ remain mutually consistent throughout the process; a necessary condition for optimal convergence. $\Pr(s_k\mid h_k)$ can be interpreted as a scaling factor which makes $U_k$ and $Q_k$ update at relatively comparable rates.  For any given ``full`` update on $U_k(h_k, s_k, a_k)$ the corresponding update on $Q_k(h_k, a_k)$ should be scaled down to a ``partial amount'' relative to the likelihood of $s_k$.
%
While we were able to remove other forms of model-based requirements, $\Pr(s_k\mid h_k)$ remains, leaving AQL just shy from reaching both optimal convergence and concrete practicality at the same time.  While it may be possible to approximate this factor in other model-free ways, AQL remains primarily a conceptual algorithm also due to the requirement of a tabular model over histories.  Either way, AQL serves as a fundamental basis to derive the next iteration of asymmetric value-based RL.

\subsection{Asymmetric DQN}\label{sec:adqn}

\begin{algorithm}[t]
	%
	\caption{Asymmetric DQN (ADQN)}
	%
	\label{algo:adqn}
	%
	\begin{algorithmic}[1]
		%
		\Require{$\umodel, \qmodel$ deep models parameterized by $\theta$.}
		%
		\State Initialize parameters $\theta$
		%
		\State Initialize and prepopulate episode buffer
		%
		\While{True}
		%
		\State \algparbox{From the simulated environment, sample episodes and append them to the episode buffer}
		%
		\State \algparbox{From the episode buffer, sample batch of transitions $\{(h_i, s_i, a_i, r_i, s'_i, o_i)\}_{i=1}^N$}
		%
		\State $L_U \gets \frac{1}{N} \sum_{i=1}^N \uloss(h_i, s_i, a_i, r_i, s'_i, o_i)$.
		%s'_i, 
		\State $L_Q \gets \frac{1}{N} \sum_{i=1}^N \qloss(h_i, s_i, a_i, r_i, s'_i, o_i)$.
		%
		\State Perform a gradient step on $\theta$ using $\nabla_\theta ( L_U + L_Q )$
		%
		\EndWhile
		%
	\end{algorithmic}
	%
\end{algorithm}

When acting in POMDPs with high-dimensional observations and states, a tabular-lookup method such as AQL becomes infeasible.
In such instances, we must introduce function approximation to generalize over similar experiences.
The use of approximation sacrifices the optimal convergence guarantee established by \Cref{thm:aql:optimality}, but is necessary to scale algorithms to significantly more challenging partially observable environments.
Nevertheless, the value function approximations are an orthogonal matter to how privileged state information is used, and we expect the sound theoretical principles upon which AQL is built will help asymmetric deep methods even when relying on function approximation.

Our primary algorithmic contribution here is \emph{Asymmetric DQN} (ADQN), an asymmetric variant of DQN derived by introducing value function approximation to AQL.
We first replace the tabular-lookup models $U$ and $Q$ of AQL with parametric differentiable models $\umodel$ and $\qmodel$.
In practice, these are implemented as deep neural networks whose architectures are chosen according to the structure of the states and observations emitted by the POMDP.

To facilitate the substitution, we must reformulate the stochastic update rules of AQL as squared-error loss minimization.
For the rest of the section, due to spacing concerns, we will use $\pmodel = g(\qmodel)$ as a shorthand to represent actions selected greedily on $\qmodel$, i.e., $\pmodel(h) = \argmax_a \qmodel(h, a)$.
Given a single environment interaction $(h, s, a, r, s', o)$, the corresponding losses can be defined as
\begin{align}
	%
	\uloss & = \left( r + \gamma \Stop\left[ \umodel(hao, s', \pmodel(hao)) \right] - \umodel(h, s, a) \right)^2 \,, \label{eq:loss:u} \\
	%
	\qloss & = \left( r + \gamma \Stop\left[ \umodel(hao, s', \pmodel(hao)) \right] - \qmodel(h, a) \right)^2 \,, \label{eq:loss:q}
	%
\end{align}
\noindent
where $\Stop$ is the \emph{stop-gradient} function which indicates that gradient calculation should not consider the enclosed terms.
It is worth noting that $\uloss$ and $\qloss$ use the same target to train $\umodel$ and $\qmodel$.
The crucial difference is that $\umodel$ is in able to model the target as a function of $s$, while $\qmodel$ is unable to do so, and can at only model the expectation of the target over values of $s$.
In a way, these losses approximately enforce a ``loose`` form of mutual consistency $\qmodel \approx E\umodel$.
In practice, the term is generated by ``target networks'' that rely on stale parameters to stabilize learning when bootstrapping~\citep{mnih_human-level_2015};
the stale parameters are periodically updated by copying the main parameters.
The total loss ${\uloss + \qloss}$ can be jointly minimized with respect to the parameters by a single backpropagation step, efficiently approximating the interleaved updates of AQL.

When the function approximators are nonlinear (as is often the case for neural networks), training will fail if the gradient updates are conducted on sequentially collected transitions that are not i.i.d.~\citep{mnih_human-level_2015}.
The second critical modification to AQL is therefore the adoption of experience replay~\citep{lin_self-improving_1992} in order to decorrelate training experiences.
Rather than training on a sample immediately when it is collected, each POMDP transition $(h, s, a, r, s', o)$ is deferred to a first-in first-out replay memory.
Periodically, when it is time to train the networks, a minibatch of several experiences is sampled from the replay memory;
gradients for these samples are computed and averaged together to estimate the true gradient of the joint loss $\smash{\uloss + \qloss}$, which in turn is used to improve the parameters.

\paragraph{Why ADQN?}
%
Having finally addressed the practical disadvantages of API, it is worthwhile to reconsider the ``why`` question again, this time focusing on why one would prefer to use ADQN compared to DQN.  Ultimately, the purpose of both algorithms is to train an approximate $\qmodel \approx Q\opt$ through which optimal control can be executed, and both algorithms should \emph{in theory} converge to very similar approximations.  What is then the advantage of ADQN over DQN?  Similarly to the asymmetric actor-critic case~\citep{baisero_unbiased_2022}, the advantage is a practical one associated with the difficulties of learning an appropriate representation of history $\phi(h)$, which is one of the major bottlenecks in PORL.  History representations are sequence models which notoriously requires lots of data and processing power for proper training.  To further compound on this issue, the quality of the data used to train the history representation in PORL is directly related to the quality of $\qmodel(\phi(h), \cdot, \cdot)$, which in turn depends on the quality of the history representation itself;  unsurprisingly, it can be quite hard to bootstrap the training of an improved history representation when starting from a poor history representation.  Note, however, that learning an appropriate state representation $\phi(s)$ is much simpler than learning $\phi(h)$ due to the non-sequential nature of individual states, i.e., the $\phi(s)$ representation model has fixed input and output sizes, and can generally be modeled using a simpler feed-forward architecture.  In ADQN, the issues associated with learning a proper history representation are alleviated by the fact that its training is bootstrapped not only on the history representation itself, but also on the state representation.  Even when the history representation is poor, we can expect the state representation to contain sufficient contextual information to allow $\umodel(\phi(h), \phi(s), \cdot)$ to model meaningful values, which in turns helps further bootstrap the learning of the history representation $\phi(h)$, the history model $\qmodel(\phi(h), \cdot)$, and the respective implicit policy $g(\qmodel)$.

Next, we consider some variants of interest of ADQN.

\subsubsection{Variance-Reduced ADQN}\label{sec:adqn:variance-reduced}

In this variant, we approximate the target of $\qloss$ from \Cref{eq:loss:q} as $r + \gamma \umodel(hao, s', \pmodel(hao)) \approx \umodel(h, s, a)$, which holds particularly well once $\umodel$ has been trained sufficiently.  Therefore, this variant uses the following losses,
%
\begin{align}
	%
	\uloss & = \left( r + \gamma \Stop\left[ \umodel(hao, s', \pmodel(hao)) \right] - \umodel(h, s, a) \right)^2 \,, \label{eq:loss:u:variance-reduced} \\
	%
	\qloss & = \left( \Stop\left[ \umodel(h, s, a) \right] - \qmodel(h, a) \right)^2 \,. \label{eq:loss:q:variance-reduced}
	%
\end{align}
%
This approximate target results in lower variance throughout the entire training process at the cost of introducing bias primarily in the early stages of training; a trade-off which may result advantageous in some control problems.

\subsubsection{State-Only ADQN}\label{sec:adqn:state}

Some prior work in asymmetric RL has adopted heuristic forms of asymmetry which uses state-only (i.e., history-less) value functions $U(s, a)$.  Such form of asymmetry is however associated with fundamental theoretical issues which may severely compromise the learning performance, ranging from potentially being ill-defined, to introducing bias into the learning process~\citep{baisero_unbiased_2022}.  Such issues are inherently related to partial observability, and their effects scale with the amount of partial observability that the agent is subject to, as well as the agent's ``reactiveness'', i.e. the amount of history that is willfully ignored by the agent itself to select actions.
%
Nonetheless, this state-only form of value function may be more useful than others in control problems which have small amounts of partial observability, such as vision-based tasks with an occlusion-free view of the environment~\citep{pinto_asymmetric_2018,baisero_unbiased_2022}.
%
Although our main focus is control problems with significant amounts of partial observability, we are still interested in formulating a state-only variant of ADQN as an additional baseline, and as reference for future work which may focus on the kinds of control problems where it thrives.

In this state-only variant, we redefine the parametric model $\umodel(s, a)$ to ignore history, and adopt the following losses,
%
\begin{align}
	%
	\uloss & = \left( r + \gamma\Stop\left[ \umodel(s', \pmodel(hao)) \right] - \umodel(s, a) \right)^2 \,, \label{eq:loss:u:state} \\
	%
	\qloss & = \left( r + \gamma\Stop\left[ \umodel(s', \pmodel(hao)) \right] - \qmodel(h, a) \right)^2 \,. \label{eq:loss:q:state}
	%
\end{align}

\subsubsection{Reduced-Variance State-Only ADQN}\label{sec:adqn:state:variance-reduced}

This variant applies a state-only variant of the variance reduction approximation from \Cref{sec:adqn:variance-reduced} to state-only ADQN.  In this case, we approximate the target of $\qloss$ from \Cref{eq:loss:q:state} as $r + \gamma \umodel(s', \pmodel(hao)) \approx \umodel(s, a)$,
%
\begin{align}
	%
	\uloss & = \left( r + \gamma \Stop\left[ \umodel(s', \pi(hao)) \right] - \umodel(s, a) \right)^2 \,, \\
	%
	\qloss & = \left( \Stop\left[ \umodel(s, a) \right] - \qmodel(h, a) \right)^2 \,.
	%
\end{align}

\section{Evaluation}

\begin{figure*}[t!]
	%
	\centering
	%
	\begin{subfigure}{.75\linewidth}
		\centering
		\includegraphics[width=\linewidth]{images/performance.legend.pdf}
	\end{subfigure}

	\begin{minipage}{.63\linewidth}

		\begin{subfigure}{.49\linewidth}
			\centering
			\includegraphics[width=\linewidth]{images/performance.POMDP-heavenhell_3-episodic-v0.pdf}
			\caption{\heavenhellthree}\label{fig:performance:heavenhellthree}
		\end{subfigure}
		%
		\begin{subfigure}{.49\linewidth}
			\centering
			\includegraphics[width=\linewidth]{images/performance.POMDP-heavenhell_4-episodic-v0.pdf}
			\caption{\heavenhellfour}\label{fig:performance:heavenhellfour}
		\end{subfigure}

		\begin{subfigure}{.49\linewidth}
			\centering
			\includegraphics[width=\linewidth]{images/performance.extra-car-flag-v0.pdf}
			\caption{\carflag}\label{fig:performance:carflag}
		\end{subfigure}
		%
		\begin{subfigure}{.49\linewidth}
			\centering
			\includegraphics[width=\linewidth]{images/performance.extra-cleaner-v0.pdf}
			\caption{\cleaner}\label{fig:performance:cleaner}
		\end{subfigure}

	\end{minipage}
	\begin{minipage}{.33\linewidth}

		\begin{subfigure}{\linewidth}
			\centering
			\includegraphics[width=\linewidth]{images/performance.gv_memory_four_rooms.7x7.yaml.pdf}
			\caption{\gvmemoryfourrooms}\label{fig:performance:gvmemoryfourrooms}
		\end{subfigure}

	\end{minipage}
	%
	\caption{Performance curves showing episodic returns averaged over the last 100 completed episodes, with statistics computed over $20$ independent runs.  The shaded areas represent one standard error around the mean.}\label{fig:performance}
	%
\end{figure*}

We perform an empirical evaluation of our proposed ADQN method and its variants in a variety of environments which feature significant amounts of partial observability.

\paragraph{Methods} We compare the performances of $5$ value-based PORL algorithms, denoted as follows:
%
\begin{itemize}
	%
	\item \dqn\ is the standard non-asymmetric DQN algorithm;
	      %
	\item \adqn\ and \adqnvr\ are the history-state ADQN algorithms from \Cref{sec:adqn,sec:adqn:variance-reduced};  and
	      %
	\item \adqnstate\ and \adqnstatevr\ are the state-only ADQN algorithms from \Cref{sec:adqn:state,sec:adqn:state:variance-reduced}.
	      %
\end{itemize}

\paragraph{Environments} Evaluations are run on $5$ partially observable navigation tasks which require information gathering strategies and memorization of the past:
%
\begin{itemize}
	%
	\item \heavenhellthree\ and \heavenhellfour~\citep{bonet_solving_1998}, corridor environments where the agent must reach the exit to \emph{heaven} and avoid the exit to \emph{hell}, but must first backtrack to visit a \emph{priest} to learn which exit is which;
	      %
	\item \carflag~\citep{nguyen_pomdp_2021}, a 1-dimensional continuous control variant of \heavenhell;
	      %
	\item \cleaner~\citep{jiang_multi-agent_2021}, a maze environment where two agents must reach all tiles to clean them.  In our experiments, the two agents are treated as a single agent, and controlled in a centralized fashion; and
	      %
	\item \gvmemoryfourrooms~\citep{baisero_gym-gridverse_2021}, a dynamically generated gridworld with $4$ connected rooms, where the agent must reach the \emph{good} exit and avoid the \emph{bad} exit, but must first find and memorize a \emph{beacon} to learn which is which.
	      %
\end{itemize}
%
\noindent A more thorough description of these environments can be found in Appendix~C of \cite{baisero_unbiased_2022}.

Each method is trained and evaluated using code available as a public repository\footnote{\url{https://github.com/abaisero/asym-rlpo/}}. For each environment and algorithm, we perform an independent grid-search over some hyper-parameters of interest (see \Cref{sec:hpsearch}), and select the combination of hyper-parameters which results in the best final performance and learning stability (prioritizing final performance if necessary).  To improve the statistical significance of the results, each combination of environment, algorithm, and hyper-parameters is run $20$ independent times.

\subsection{Results and Discussion}

\Cref{fig:performance} shows the results of these evaluations, which broadly confirm our theoretical analysis on asymmetric value-based PORL, the practical advantage of employing history-state forms of evaluation to aid partially observable control, and confirm the superiority of ADQN compared to other similar symmetric and asymmetric variants.  This evaluation further confirms other recently developed theoretical analysis on asymmetric PORL, i.e., that state-only forms of asymmetry are inadequate to handle non-trivial amounts of partial observability~\citep{baisero_unbiased_2022}.

% both prior work on the important of employing correct forms of asymmetry for partially observable control, the theoretical analysis of our work on asymmetric value-based PORL, and the practical advantage of of ADQN of employing history-state forms of evaluation to aid partially observable control.

% and prior work concerning the importance of employing correct forms of asymmetry for partially observable control.

Across the board, \adqn\ and \adqnvr\ outperform all baselines in final performance, convergence speed, and/or overall learning stability.  The contrast between methods is particularly stark in \Cref{fig:performance:heavenhellthree,fig:performance:heavenhellfour}, where \adqn\ and \adqnvr\ are not just the only methods that demonstrate any substantial improvement, but are also able to reach optimal performance.
%
On the other hand, the state-only variants fail to outperform even the \dqn\ baseline in most environments (with a single exception discussed later), which further confirms the theoretical issues that have been recently associated with state-only forms of asymmetry.
%
Broadly, the variance-reduced variants \adqnvr\ and \adqnstatevr\ only differ in relatively minor ways from their default counterparts.  Such differences can be found in \Cref{fig:performance:heavenhellthree,fig:performance:cleaner}, where \adqn\ is more stable or has better convergence properties than \adqnstatevr, and in \Cref{fig:performance:carflag,fig:performance:cleaner}, where \adqnstatevr\ has better final convergence values than \adqnstate.
%
This seems to indicate that the type of asymmetry (history-state or state-only) is a larger contributor to overall performance than the choice of using the standard or the variance-reduced variant of the same method.
%
In \Cref{fig:performance:gvmemoryfourrooms} too, \adqn\ and \adqnvr\ outperform all other baselines.  However, these results also represent an interesting exception to some of the above analysis, e.g., \adqnstate\ outperforms \dqn\, and \adqnvr\ outperforms \adqn.  To explain this, we note that this is the only task not reliably solved by any of the methods, which is likely due to the highly dynamic nature of the randomly generated map and object locations.  In fact, the fact that some of the trends found in the other results do not also appear here may be explained by the fact that none of the methods achieve their full potential performance.

\section{Conclusions}

OTOE is a RL framework where agents are trained offline in a simulated environment, which allows temporary access to privileged information which would otherwise be unavailable, like the partially observable environment's state.  Asymmetry is a common mechanism through which such privileged information can be used during training, and has the potential to greatly boost learning performance and efficiency when implemented correctly.  However, modern work in asymmetric RL tends to focus on unproven heuristics which lack a theoretical justification.  In this work, we filled this void and developed the theory of asymmetric value-based RL.  We achieved our primary goal of developing a theoretically-sound asymmetric value-based RL algorithm by employing a bottom-up approach, and by first focusing on the base theory of asymmetric policy improvement.  This took the form of API, a conceptual solution method with strict optimal convergence guarantees but concrete practical limitations.  Then, we applied a series of relaxations to API which addressed those limitations and ultimately resulted in ADQN, a practical and competitive deep RL algorithm.  We performed an empirical evaluation to compare the performances of ADQN and its variants to standard non-asymmetric DQN in a series of environments which are specifically selected to exhibit high levels of partial observability, and which require information-gathering strategies and memorization of the past.  In all these environments, ADQN achieved the best performance, even solving control problems which standard DQN could not.  Overall, our evaluation confirmed the potential offered by privileged information, the importance of using it in principled and theoretically-guided ways, and the overall success our ADQN algorithm in partially observable control problems.
%
Future work may focus on extending ADQN to the multi-agent control case, which poses further learning challenges, on finding applications where state-only ADQN may thrive (such as vision-based robotic tasks with little partial observability), and on extending the evaluation of ADQN in more complicated partially observable vision-based tasks.

\begin{contributions} % will be removed in pdf for initial submission,
	% so you can already fill it to test with the
	% ‘accepted’ class option
	%
	Andrea~Baisero conceived the idea, developed proofs, ran experiments, and wrote the paper.
	Brett~Daley developed proofs and wrote the paper.
	Christopher~Amato supervised.
	%
\end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
	% so you can already fill it to test with the
	% ‘accepted’ class option
	%
	This research was funded by NSF award 1816382.
	%
\end{acknowledgements}

\bibliography{baisero_636}

\end{document}
