% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)
% Added by Fan
% \usepackage{hyperref}
% \hypersetup{
%     colorlinks=true,
%     linkcolor=blue,
%     filecolor=blue,      
%     urlcolor=blue,
%     citecolor=cyan,
% }
\usepackage{xcolor}
\usepackage{amsmath,amssymb,amsthm,mathrsfs,url,array}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{wrapfig}
\usepackage{appendix}
\usepackage{multirow}
\usepackage{makecell}
\usepackage{diagbox}
% \usepackage{algorithm}
\usepackage[ruled,linesnumbered]{algorithm2e}
% \usepackage{algorithmic}
% \theoremstyle{break}
\newtheorem{Def}{Definition} 
\newtheorem{Th}{Theorem}
\newtheorem{Co}{Corollary}
\newtheorem{Lm}{Lemma}
\newtheorem{Prop}{Proposition} 
\allowdisplaybreaks[4]

\newcommand{\xyx}[1]{\textcolor{red}{[XYX]: #1}}
\newcommand{\fan}[1]{\textcolor{blue}{[Fan]: #1}}
\newcommand{\org}[1]{\textcolor{green}{[ORIGINAL]: #1}}
\newcommand{\jzl}[1]{\textcolor{brown}{[JZL]: #1}}
%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{X-MEN: Guaranteed XOR-Maximum Entropy \\
Constrained Inverse Reinforcement Learning}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% Add authors
\author[1]{\href{mailto:<ding274@purdue.edu>?Subject=Your UAI 2022 paper}{Fan Ding}{}}
\author[1]{Yexiang Xue}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science\\
    Purdue University\\
    West Lafayette, Indiana, USA.
}
  
\begin{document}
\maketitle

%\input{tex/abstract}
\begin{abstract}
Inverse Reinforcement Learning (IRL) is a powerful way of learning from demonstrations. 
%
In this paper, we address IRL problems with the availability of prior knowledge that optimal policies will never violate certain constraints. 
%
Conventional approaches ignoring these constraints need many demonstrations to converge. 
%
We propose XOR-Maximum Entropy Constrained Inverse Reinforcement Learning (X-MEN), which is guaranteed to converge to the global optimal reward function in linear rate w.r.t. the number of learning iterations. 
%
X-MEN embeds XOR-sampling -- a provable sampling approach which transforms the \#-P complete sampling problem into queries to NP oracles -- into the framework of maximum entropy IRL. 
%
%XOR-sampling  and has a constant approximation guarantee on the probabilities of the samples obtained, 
%
X-MEN also guarantees the learned IRL agent will never generate trajectories that violate constraints. 
%
Empirical results in navigation demonstrate that X-MEN converges faster to the optimal rewards compared to baseline approaches and always generates trajectories that satisfy multi-state combinatorial  constraints. 
%Inverse Reinforcement Learning provides an efficient tool for generalizing the demonstration behavior, based on the assumption that the expert is optimally acting in a Markov Decision Process. 
%
%However, it is often the case that such behavior is more succinctly represented by a simple reward combined with both global and local hard constraints, where the agent have to maximize cumulative rewards subject to these given constraints on their behavior.
%
%Previous approaches either cannot handle these constraints or can only deal with those local ones.
%
%We present X-MEN, a novel approach based on the Maximum Entropy IRL framework and the success of XOR-Sampling, to perform constrained reasoning about the likelihood of the expert's demonstrations given our knowledge of an MDP's dynamics.
%
%Empirical results on both simulated behavior on gridworld
%and recorded data of humans navigating around an obstacle show the efficacy of our approach to recover the reward function with both state-action specific or trajectory-long hard constraints.
\end{abstract}

%\input{tex/intro}
\section{INTRODUCTION}
\begin{figure*}[t]
\subfigure[Add no constraint]{\label{fig:intuition1}
\includegraphics[width=0.32\linewidth]{figs/intuition1.pdf}}
\subfigure[Add single-state constraints $\mathcal{C}_1$]{\label{fig:intuition2}
\includegraphics[width=0.32\linewidth]{figs/intuition2.pdf}}
% \hspace{-0.2in}
\subfigure[Add multi-state  constraint $\mathcal{C}_2$]{\label{fig:intuition3}
\includegraphics[width=0.32\linewidth]{figs/intuition3.pdf}}
% \hspace{-0.2in}
% \vspace*{-0.2cm}
\caption{Examples of constrained IRL problems. The agent wants to move from the start state $S_0$ (blue grid) to the goal state $S_G$ (green grid). Ground truth demonstration is shown in the red line. The same initial reward function before learning is used for all 3 situations, with one-step reward listed in each grid. Most likely trajectories under the initial reward function (e.g.,those maximizing rewards and subject to constraints) are shown using blue dashed arrows. (\textbf{a}) When no constraint is added to the MDP, the agent finds the shortest path directly upward from $S_0$ to $S_G$. (\textbf{b}) When single-state constraint $\mathcal{C}_1$, which forbids the agent to  to pass through the red grids, is imposed, the agent can detour from either the  left or the right side. (\textbf{c}) When there are an additional multi-state constraint $\mathcal{C}_2$ imposed, which constrains at least half of all  passing states in the shaded area,   the optimal trajectory is to detour from the right side. Notice that this behavior aligns with the demonstration.} 
\label{fig:intuition}
\end{figure*}

Inverse Reinforcement Learning (IRL) \citep{ng2000algorithms,abbeel2004apprenticeship,ziebart2008maximum,arora2021survey,li2017deep}
provides an important way to learn from demonstrations. 
%
IRL assumes that the demonstrator implicitly maximizes the cumulative reward of a Markov Decision Process (MDP). 
%
The goal of IRL is to recover the unknown reward function from the observed demonstrations. 
%provides an avenue for addressing this challenge by generally
%reducing the problem of recovering a demonstrated behavior to the recovery of a reward function that induces the observed behavior, 
%
Various IRL algorithms have been proposed, including Linear IRL \citep{ng2000algorithms,abbeel2004apprenticeship} and Large-Margin Q-Learning \citep{ratliff2006maximum}. 
%
To differentiate among multiple reward functions which lead to similar behaviors, Maximum Entropy Inverse Reinforcement Learning (MaxEnt IRL) \citep{ziebart2008maximum,wulfmeier2015maximum,finn2016guided,ho2016generative} assumes that the demonstrator samples trajectories from a maximum entropy distribution parameterized by the cumulative reward. 
%, a very influential approach which addresses the inherent ambiguity of . This approach has been extended with deep networks as function approximators for the reward function \cite{} and is the basis for several more recent algorithms that lift some of its assumptions \cite{}. 

In this paper, we focus on IRL problems where certain constraints are known beforehand and hence do not need to be rediscovered by the learning algorithm. 
%
The trajectories from the demonstrator are known to satisfy these constraints and we require the IRL agent to satisfy these constraints as well. 
%
Indeed, standard IRL algorithms \citep{abbeel2007application,vasquez2014inverse,scobee2018haptic} can be applied to this scenario without modifications and they eventually discover the optimal reward function, which generates trajectories satisfying all constraints. 
%
Nevertheless, it may require a large amount of demonstrations to learn these constraints. 
%
Worse still, it is still possible for the IRL agent to produce trajectories which occasionally violate constraints even after many training epochs.
%
This is especially problematic in safety critical domains, such as autonomous driving, robotic surgery, etc. 

%Certainly, IRL Ignoring these constraints ...
%While these types of IRL algorithms have proven useful in a variety of situations , their basis in assuming that reward functions fully represent task specifications makes them ill-suited to problem domains with hard constraints or non-Markovian objectives. Such constraints arise in safety-critical systems, where requirements such as an autonomous vehicle avoiding collisions with pedestrians are more naturally expressed as hard constraints than as soft reward penalties.

Recent work has attempted to embed constraints into IRL. For example, the work of \citep{vazquez2017learning,kalweit2020deep}  uses demonstrations to learn a rich class of possible specifications that can represent a task. Others have focused specifically on learning constraints, that is, behaviors that are expressly forbidden or infeasible \citep{chou2018learning,subramani2018inferring,mcpherson2018modeling,scobee2019maximum,anwar2020inverse,mcpherson2021maximum}. 
%
Nevertheless, so far the attempts have been focused on \textit{single-state} constraints, where a handful of actions are forbidden in certain states and these forbidden actions have little impact for future state-action transitions. Their approaches  cannot address \textit{multi-state combinatorial} constraints, which limits a chain of actions spanning multiple time stamps. 
%
For example, Figure~\ref{fig:intuition} (c) demonstrates a navigation task where constraints require at least half of the states in each trajectory is located in the shaded area. 
%
With this constraint imposed, only trajectories passing the right-hand side are possible. 
%
Such constraints cannot be addressed with previous approaches, which mask out actions from certain states. 
 
%However, these methods only consider local constraints on states, actions and features in a
%Markov Decision Process (MDP), which are less than enough to represent most real-world scenarios as most constraints are trajectory-long. For instance, an agent has to know that certain paths can never lead to an exit in a labyrinth. It is towards the problem of inferring such hard combinatorial constraints that we turn our attention.

In this work, we propose \textbf{X}OR-\textbf{M}aximum \textbf{EN}tropy (X-MEN) Constrained Inverse Reinforcement Learning, which \textbf{\textit{provably converges to the optimal reward function for MaxEnt IRL in linear number of training steps}}, even in the presence of hard combinatorial constraints. 
%
X-MEN also guarantees to produce trajectories which satisfy multi-state combinatorial constraints. 
%
X-MEN is based on the Maximum Entropy IRL learning \citep{ziebart2008maximum,boularias2011relative}. 
%
Distinctively, X-MEN harnesses XOR-sampling to estimate the  gradient of the expected reward from the current model distribution. 
%
The recently proposed XOR-Sampling \citep{Gomes2006NearUniformSampling,Ermon13Wish,ermon2013embed} reduces the sampling problem into queries of NP oracles via hashing and projection, and guarantees a constant factor approximation for the expectation estimation. 
%
%After obtaining samples, 
To maximize the likelihood of the demonstrated behavior, X-MEN uses Stochastic Gradient Descent (SGD) to maximize the difference between expected reward from the demonstration and that from the trajectories sampled from the current model distribution, a procedure closely resembling contrastive divergence learning. 
%
Theoretic analysis reveals that X-MEN provably converges to the \textit{global optimum} of the likelihood function in linear number of SGD iterations. 
%
%Satisfying constraints. 
%
In addition, X-MEN can handle rewards parameterized either in a linear form or in the representation of a neural network.
%
During testing, the policy learned by X-MEN can also be adapted to satisfying additional constraints without retraining. %on the task.

%satisfying a list of predefined local and global hard constraints, during which we This knowledge allows us to generate always valid trajectories during the contrastive learning process to  Our method improves on prior work by being able to both simultaneously consider local constraints on states, actions and features in a Markov Decision Process (MDP) and global constraints on the whole trajectory. We show even assuming the knowledge of the transition probability, forward-backward dynamic programming algorithm can not handle this constrained IRL problem well. 

In experiments, we compare the performance of X-MEN against MaxEnt IRL \citep{ziebart2008maximum} and additional baselines such as Reletive Entropy IRL (RE-IRL) \citep{boularias2011relative} and recently proposed maximum likelihood constraint inference (MLCI) \citep{scobee2019maximum} on several grid world environments and in an imitation learning environment with human data to navigate around obstacles. All these environments require the agent to follow constraints. 
%
Our experiment shows after learning, the generated trajectories of X-MEN 100\% satisfy constraints, while a majority of trajectories produced by competing approaches do not ($\geq 60\%$ violate constraints). 
%
Also X-MEN produces trajectories that closely imitate  demonstrations. 
%
In summary, our contributions are as follows:
\begin{itemize}
    \item We propose X-MEN, an algorithm that provably converges to the optimal reward function for MaxEnt IRL in linear number of training steps, even in the presence of multi-state combinatorial constraints.
    \item X-MEN is guaranteed to produce trajectories which satisfy combinatorial constraints, beyond the capability of previous approaches. 
    \item Experimental results reveal that X-MEN produces trajectories that closely resemble demonstration while satisfying constraints, outperforming a series of constrained IRL baselines. 
\end{itemize}
%\cite{}




%\input{tex/prelim}
% \section{Preliminary}

\section{INVERSE REINFORCEMENT LEARNING}
Here we present a brief overview of IRL.   $\mathcal{M}=\{\mathcal{S}, \mathcal{A}, T, R, \gamma\}$ is a Markov Decision Process (MDP), where $\mathcal{S}$ denotes the space of all states $s$, $\mathcal{A}$ denotes the set of possible actions $a$, $T$ denotes the transition probability function, $R$ denotes the reward function, and $\gamma \in [0, 1]$ is the  discount factor. Given an MDP, an optimal policy $\pi^*$ is the one to maximize the expected cumulative reward. 
%
IRL considers the case where  the reward function is unknown. Instead, a set of expert demonstrations $\mathcal{D}=\{\tau_1,\ldots, \tau_N\}$ is provided. %, and the goal is to learn the hidden reward function that generates these demonstrations. 
%which are sampled from the optimal policy $\pi^*$. %i.e. provided by a demonstrator. 
Each demonstration
consists of a series of state-action pairs $\tau_i=\{ (s_{i1},a_{i1}),\ldots,(s_{iL_{i}},a_{iL_{i}})\}$, where $L_i$ denotes the length of the trajectory. The goal of IRL is to uncover the hidden reward $R$ from the demonstrations.

\subsection{Maximum Entropy IRL}
A number of approaches have been proposed to tackle the IRL problem \citep{ng2000algorithms,abbeel2004apprenticeship,ratliff2006maximum}. One crucial problem to address for IRL is to differentiate among multiple reward functions that lead to the same demonstrations. 
%
An influential formulation is Maximum Entropy IRL \citep{ziebart2008maximum}, which can also be viewed as a special case of Relative Entropy IRL (RE-IRL) \citep{boularias2011relative,snoswell2020revisiting}. %, where the baseline policy $\pi_0$ in RE-IRL is \xyx{set to ???}.. 
In this formulation, the probability that the demonstrator chooses a given trajectory is proportional to the exponent of the reward along the path. Denote  $R_{\theta_1}(\tau)=\sum_{t=1}^{L}\gamma^t R_{\theta_1}(s_t,a_t)$ as the discounted cumulative reward  parameterized by $\theta_1$. The probability of choosing trajectory $\tau$ is proportional to:
\begin{align}
    P_{choice}(\tau|\theta_1) \propto  e^{R_{\theta_1}(\tau)}.\label{eq:zchoice}
\end{align}% \frac{1}{Z_{\theta}}
Let $d_{0}$ as the probability distribution of the initial state. %
$D(\tau)=d_0(s_1)\prod_{t=1}^{L}T(s_{t+1}|s_t,a_t)$ is the probability of state action transitions which leads to the trajectory $\tau$.
%
Following the standard setup for (inverse) reinforcement learning, we assume $D(\tau)$ is unknown and needs to be learned from the interactions with the IRL system. 
%
For this paper, we parameterize $D(\tau)$ in the form of $e^{d_{\theta_2}(\tau)}$, where $\theta_2$ is the parameter to be learned.
%
Hence, the overall probability of observing trajectory $\tau$ from demonstrations is proportional to the product of the choice probability and the state transition probability:
\begin{align*}
    P(\tau|\theta,T) \propto e^{R_{\theta_1}(\tau)}D(\tau)=e^{R_{\theta_1}(\tau)+d_{\theta_2}(\tau)}=e^{R_{\theta}(\tau)}.
\end{align*}
where $\theta=[\theta_1, \theta_2]$ is overall parameters to learn. We use $R_{\theta}(\tau)$ to represent $R_{\theta_1}(\tau)+d_{\theta_2}(\tau)$ with a slight overload of notations. 
%where , and 
%where $\tau$ is a trajectory example, ,The partition function $Z_{\theta}$ is  and is usually the most difficult part to compute.
%The Max-Ent IRL can  

\subsection{IRL with Multi-state Combinatorial Constraints}
Despite the success of many IRL models, many real world tasks require additional constraints to be satisfied when learning from demonstrations. 
%For instance, an agent has to know that certain paths can never lead to an exit in a labyrinth. 
%
In this work, we restrict ourselves to dealing with hard combinatorial constraints, as shown in Figure \ref{fig:intuition}. Note that this is not particularly restrictive since, for example, safety constraints and/or constraints imposed by physical laws are often hard. Different from previous work that only defines constraints as a set of forbidden state-action pairs, which we call single-state constraints, here we consider more general  cases of combinatorial constraints that span multiple states. Denote $C(\tau)=\{c_i(\tau)\}$ as the set of constraints that each trajectory must satisfy, and $I_C(\tau)$ the indicator function of whether constraints $C(\tau)$ are satisfied. Formally,
\begin{align*}
    I_C(\tau)=\begin{cases}
    1, ~~~~ \text{if}~ \tau~ \text{satisfies the constraints set }~ C(\tau)\\
    0, ~~~~ \text{otherwise}
    \end{cases}
\end{align*}
We augment the MDP into the constrained MDP: $\mathcal{M}^C=\{\mathcal{S}, \mathcal{A}, T, R, C\}$. In this case, the probability of observing a trajectory $\tau$ now becomes:%\footnote{Notice $Z_\theta$ in Equation \ref{eq:constarined_p}  is different from $Z_\theta$ in Equation \ref{eq:zchoice} because of the introduction of $I_C$. $Z_\theta$ still normalizes the probability in Equation %\ref{eq:constarined_p}. Without too much cluttering, we use the same symbol $Z_\theta$ in both equations.}
\begin{align}\label{eq:constarined_p}
    P(\tau|\theta,T)=\frac{1}{Z_{\theta}}e^{R_{\theta}(\tau)}I_C(\tau),
\end{align}
%where $D(\tau)=d_0(s_1)\prod_{t=1}^{L}T(s_{t+1}|s_t,a_t)$. 
Here $Z_\theta$ is a normalization constant to ensure $P(\tau|\theta, T)$ is a probability distribution.
Given the set of expert demonstrations $\mathcal{D}$, we want to find the best reward function by maximizing the log likelihood function $L(\theta)$.
\begin{align*}
    \text{argmax}_{\theta}L(\theta)= \text{argmax}_{\theta}\frac{1}{|\mathcal{D}|}\sum_{\tau\in \mathcal{D}}R_{\theta}(\tau) - \log Z_{\theta}.
\end{align*}
Notice only the terms related to the optimization variable $\theta$ are included in the rightmost equation.

%\input{tex/method}
\section{XOR Maximum Entropy IRL}

In this section we propose \textbf{X}OR-\textbf{M}aximum \textbf{EN}tropy Constrained Inverse Reinforcement Learning (X-MEN), to solve the inverse reinforcement learning problem with multi-state combinatorial constraints.
We develop X-MEN based on maximum entropy inverse reinforcement learning \citep{ziebart2008maximum,boularias2011relative,finn2016guided}.  Specifically, the model assumes that the expert samples the demonstrated trajectories $\{\tau_i\}$ from the distribution $P(\tau|\theta,T)$ in Equation \ref{eq:constarined_p},
where $R_{\theta}(s_t, a_t)=\theta^Tf(s_t,a_t)$ is represented by a linear combination of feature vector $f(s_t,a_t)$. $f(s_t,a_t)$ can be hand-crafted or generated by a deep neural network. 
%
Forward-backward dynamic programming can hardly solve this problem even if the state-transition function is given, due to the presence of the hard combinatorial constraints $I_C(\tau)$.
%
Our X-MEN has the ability to solve this problem by leveraging XOR  sampling to estimate  $P(\tau|\theta,T)$.
%
After learning, X-MEN will only take actions that lead to trajectories satisfying constraints. 

We use Stochastic Gradient Descent (SGD) to optimize the objective, where in each iteration we compute the gradient of the log likelihood as follows:
\begin{align}\label{eq:grad_ll}
    &\nabla_{\theta}L(\theta)=\frac{1}{|\mathcal{D}|}\sum_{\tau\in \mathcal{D}}\nabla_{\theta} R_{\theta}(\tau) - \nabla_{\theta}\log Z_{\theta}\notag\\ 
    =&\frac{1}{|\mathcal{D}|}\sum_{\tau\in \mathcal{D}}\nabla_{\theta} R_{\theta}(\tau) -\sum_{\tau}P(\tau|\theta,T) \nabla_{\theta}R_{\theta}(\tau). 
\end{align}
The first term in Equation~\ref{eq:grad_ll} represents the expectation of $\nabla_{\theta} R_{\theta}(\tau)$ over all the trajectories in the training dataset, i.e., $\mathbb{E}_D[\nabla_{\theta} R_{\theta}(\tau)]$. The second term is the expectation of $\nabla_{\theta} R_{\theta}(\tau)$ over trajectories drawn from $P(\tau|\theta,T)$, i.e.,  $\mathbb{E}_P[\nabla_{\theta} R_{\theta}(\tau)]$. 
%
To approximate $\nabla_{\theta}L(\theta)$ in Equation~\ref{eq:grad_ll}, we sample $M_1$ trajectories from the dataset of demonstrations to form the set $\mathcal{D}_{M_1}$. 
%
Then we sample $M_2$ trajectories from $P(\tau | \theta, T)$, to form $\mathcal{D}_{M_2}^P$.
%
We use $g_{\theta}$ in the following Theorem \ref{Th:compute_g} to approximate  $\nabla_{\theta}L(\theta)$:
\begin{Th}\label{Th:compute_g}
Let the model distribution $P(\tau|\theta,T)$ defined in Equation \ref{eq:constarined_p} and $R_{\theta}(s_t, a_t)=\theta^Tf(s_t,a_t)$. The gradient of the likelihood function defined in Equation \ref{eq:grad_ll}. Let $g_{\theta}$ be  
\begin{align}\label{eq:g_theta}
    g_{\theta}=\frac{1}{M_1}\sum_{\tau\in \mathcal{D}_{M_1}} f(\tau) - \frac{1}{M_2}\sum_{\tau\in\mathcal{D}^P_{M_2}}f(\tau),
\end{align}
where $\mathcal{D}_{M_1}$, $\mathcal{D}_{M_2}^P$ are defined above. We must have $g_{\theta}$ is an unbiased estimation of $\nabla_{\theta}L(\theta)$, ie., $\mathbb{E}[g_{\theta}] = \nabla_{\theta}L(\theta)$. 
\end{Th}

XOR-Sampling is used to obtain samples from $P(\tau | \theta, T)$ such that the probability of drawing a sample is sandwiched between a constant multiplicative bound of the true probability. %a constant approximation.
%
XOR-Sampling is the result of a rich line of research \citep{ermon2013embed,Gomes06XORCounting,Gomes2007XORCounting}, which translates the \#-P complete sampling problem into queries to NP oracles with provable guarantees. 
%
The high level idea of XOR sampling is as follows. 
%
Suppose one would like to draw one ball uniformly at random from an urn, with access to an oracle that returns one ball from the urn once queried (implemented as an NP-oracle when sampling in a combinatorial space). Notice that the oracle will not return the balls uniformly at random; i.e., it may return the same ball every time. 
%
XOR-sampling removes the balls from the urn by introducing additional XOR constraints. One can prove that half of the balls are removed at random, each time when one XOR constraint is introduced. 
%
Hence, one keeps adding XOR constraints until there are only one ball remaining. Then the last ball is returned. 
%
Since the balls are removed at random, the last left  must be a random one drawn from the original set of balls.
%
In practice, XOR-sampling also works with weighted probability distributions. 
%
While giving strong probabilistic guarantees, XOR-sampling requires solving NP-complete problems during the sampling process. Hence it introduces additional  computational overhead compared to conventional approaches, e.g., MCMC sampling, etc. 
%
Nevertheless, recent advancements in constraint solvers allow us to solve industrial-sized combinatorial problems within reasonable amount of time. 
%
While we notice the trade-off between the computational overhead and the sample quality, we find the benefit of using XOR-sampling overweighs its cost in solving IRL problems involving hard combinatorial constraints. 

%
Our paper uses the probabilistic bound of XOR-sampling via Theorem~\ref{Th:bound2}. 
%
We refer the readers to \cite{ermon2013embed,fan2021xorcd,fan2021xorsgd} for the details on the discretization scheme and the choice of the parameters of XOR-sampling to obtain the bound in Theorem \ref{Th:bound2}.


\begin{Th}\label{Th:bound2}\citep{ermon2013embed}~
Let $\delta>1$, $0 < \gamma < 1$, $w: \{0,1\}^n \rightarrow \mathbb{R}^+$ be an unnormalized 
probability density function. % where $n=|\mathcal{S}||\mathcal{A}|$.
$Q(\tau|{\theta}) \propto w(\tau)$ is the normalized distribution and $C(\tau)$ is the set of hard combinatorial constraints. Then, with probability at least $1-\gamma$, XOR-Sampling$(w, C(\tau), \delta, \gamma)$ succeeds and outputs a sample $\tau_0$ by querying $O(-n\log(1- 1/\sqrt{\delta}) \log({-n/\gamma\log(1-1/\sqrt{\delta}))})$ NP oracles. Upon success, each $\tau_0$ is produced with probability $Q'(\tau_0)$. 
% We must have 
% $$1/\delta Q(\tau_0|\hat{\theta}) \leq Q'(\tau_0) \leq \delta Q(\tau_0|\hat{\theta}).$$ 
% Moreover, 
Let $\phi:\{0,1\}^n\rightarrow\mathbb{R}^+$ be one non-negative function, then the expectation of 
one sampled $\phi(\tau)$ satisfies,
\begin{align}
    \frac{1}{\delta}\mathbb{E}_{Q}[\phi(\tau)]\leq\mathbb{E}_{Q'}[\phi(\tau)] \leq\delta\mathbb{E}_{Q}[\phi(\tau)].\label{eq:bound_eq2}
\end{align}
\end{Th}



The detailed procedure of X-MEN is shown in Algorithm \ref{alg:X-MEN}. Here we demonstrate the version of X-MEN, where the only parameter to optimize is $\theta$. 
%
A variant of this algorithm can be developed which back-propagate the gradient over the feature vector $f(s,a)$ as well, when $f(s,a)$ is represented as a neural network and is also  updated during learning.
%
Notice when $f(s,a)$ is represented as a neural network, the log likelihood function is no longer concave. Hence the formal guarantees stated in Theorem 
\ref{Th:main} do not apply. However, this does not prevent X-MEN from being a useful algorithm in practice. 

%
X-MEN takes as inputs the feature vector $f(s,a)$, transition probability $D(\tau)$, constraint set $C(\tau)$, training data $\{\tau_i\}_{i=1}^N$, initial model parameter $\theta_0$, 
the learning rate $\eta$, the number of SGD iterations $K$,
XOR-Sampling parameters $(\delta, \gamma)$, and batch sizes $M_1$, $M_2$, and outputs the averaged learned parameter $\overline{\theta_{K}}$. 
%
To approximate $\mathbb{E}_{P}[\nabla_{\theta}R_{\theta}(\tau)]$ at the $k$-th iteration, X-MEN draws $M_2$ samples $\tau'_1, \dots, \tau'_{M_2}$ from  $P(\tau|{\theta}, T)$ using XOR-Sampling, where $M_2$ is a user-determined sample size.
%
Because XOR-Sampling has a failure rate, X-MEN repeatedly call XOR-Sampling until all $M_2$ samples are obtained successfully (line 3 -- 8). Then, X-MEN also draws $M_1$ samples from the training set $\{\tau_i\}_{i=1}^N$ uniformly at random to approximate $\mathbb{E}_{\mathcal{D}}[\nabla_{\theta}R(\theta)]$. 
%
Once all the samples are obtained, X-MEN uses 
$g_k=\frac{1}{M_1}\sum_{\tau\in \mathcal{D}_{M_1}} f(\tau)-\frac{1}{M_2}\sum_{j=1}^{M_2}f(\tau'_j)$ as an approximation for the gradient of the negative log likelihood. 
%\begin{align}
%    \overline{g_t}=
%\end{align}
$\theta$ is updated following the rule $\theta_{k+1} = \theta_{k}+\eta g_k$ for  $K$ steps, where $\eta$ is the learning rate. Finally, the average of $\theta_1, \ldots, \theta_K$, namely  $\overline{\theta_{K}}=\frac{1}{K}\sum_{k=1}^{K}\theta_k$ is the output of the algorithm.
%
We show in the next sections that X-MEN enjoys the property of convergence to the global optimum of the log likelihood objective in linear number of iterations, and illustrate how to incorporate XOR-Sampling into our framework for sample generation with strict constraint satisfaction.

\begin{algorithm}[t!]
   \caption{XOR Maximum Entropy Constrained Inverse Reinforcement Learning (X-MEN)}
   \label{alg:X-MEN}
   \LinesNumbered
   \KwIn{$\theta_0, f(s, a), K, \eta, \delta, \gamma, D(\tau), C(\tau), M_1,M_2,\mathcal{D}$.}
   \For{$k=0$ {\bfseries to} $K$}{
        $j\gets 1$     \tcp*[f]{\text{$M_1$ and $M_2$ are batch size}}\\
        \While{$j\leq M_2$}{
            $\tau'\gets$ XOR-Sampling$\left(e^{{\theta_k}^T f(\tau)}, C(\tau), \delta, \gamma\right)$ 
            % \tcp*[f]{\text{$\delta,\gamma$ are parameters}}\\
            \If{$\tau' \neq Failure$}  {
                $\tau'_j\gets \tau'$; $j\gets j+1$ 
                % \tcp*[f]{\text{$\tau'==Failure$ means failure of XOR-Sampling}}\\
            }
      }
      Get samples~~ $\mathcal{D}_{M_1}=\{\tau_j\}_{j=1}^{M_1}$ from $\mathcal{D}$.\\
    %   Compute the gradient~~~ 
      $g_k=\frac{1}{M_1}\sum_{\tau\in \mathcal{D}_{M_1}} f(\tau)-\frac{1}{M_2}\sum_{j=1}^{M_2}f(\tau'_j)$\\
      Update the parameters~~ $\theta_{k+1} = \theta_{k}+\eta g_k$
    }
    \textbf{return}$~~\overline{\theta_{K}}=\frac{1}{K}\sum_{k=1}^{K}\theta_k$
    % \tcp*[f]{\text{return the averaged learned parameter}}
\end{algorithm}

\subsection{Linearly Converge to the Global Optimum}
Suppose the only parameter to learn is $\theta$, in other words, $f(x,a)$ are fixed, 
the reward function $R_{\theta}(\tau)$ is represented by a linear combination of hand-crafted features, 
we can see that the objective is concave with regard to $\theta$. Under this circumstance,  X-MEN converges to the global optimum of the log likelihood function in addition to a few vanishing terms. 
%
Moreover, the speed of the convergence is linear 
with respect to the number of stochastic gradient descent steps. 
%
Denote $Var_{\mathcal{D}}(f(\tau)) = \mathbb{E}_{\mathcal{D}}[||f(\tau)||_2^2] - ||\mathbb{E}_{\mathcal{D}}[f(\tau)]||_2^2$ and 
$Var_{P}(f(\tau)) = \mathbb{E}_{{P}}[||f(\tau)||_2^2] - ||\mathbb{E}_{{P}}[f(\tau)]||_2^2$ as the total variations of $f(\tau)$ w.r.t. the data distribution $P_{\mathcal{D}}$ and model distribution $P(\tau|\theta,T)$.
The precise mathematical form of the convergence theorem states:% as follows:
% \begin{Th}\label{Th:main}
% (main)~ Let $P(\tau|\theta,T)$ and $Q(\tau|{\theta})$ as defined in Equation \ref{eq:constarined_p} and \ref{eq:Q},  $R_{\theta}(\tau)=\theta^Tf(\tau)$. %and $\hat{\theta}=\theta$.
% %
% Given trajectories $\mathcal{D}=\{\tau_i\}_{i=1}^N$ and the objective function $L(\theta)$,  denote $OPT=\min_{\theta} L(\theta)$ and $\theta^*=\text{argmin}_{\theta}L(\theta)$. 
% Let $Var_{\mathcal{D}}(f(\tau))\leq\sigma_1^2$ and $\max_{\theta}Var_{P}(f(\tau))\leq \sigma_2^2$. 
% %
% Suppose $1\leq\delta\leq\sqrt{2}$ is used in XOR-sampling, the learning rate $\eta\leq \frac{2-\delta^2}{\sigma_2^2\delta}$, and $\overline{\theta_K}$ is the output of X-MEN. We have: 
% \begin{align*}
%      \mathbb{E}[L(\overline{\theta_K})]-OPT \leq\frac{\delta^2||\theta_0-\theta^*||_2^2}{2\eta K}+\frac{\eta\sigma_1^2}{\delta^2 M_1}+\frac{\eta\sigma_2^2}{\delta^2M_2}.
% \end{align*}
% \end{Th}
\begin{Th}\label{Th:main}
(main)~ Let $P(\tau|\theta,T)$ be defined in Equation \ref{eq:constarined_p}, % and \ref{eq:Q}, 
$R_{\theta}(\tau)=\theta^Tf(\tau)$. %and $\hat{\theta}=\theta$.
%
Given trajectories $\mathcal{D}=\{\tau_i\}_{i=1}^N$ and the objective function $L(\theta)$,  denote $OPT=\max_{\theta} L(\theta)$ and $\theta^*=\text{argmax}_{\theta}L(\theta)$. 
Let $Var_{\mathcal{D}}(f(\tau))\leq\sigma_1^2$, $||\mathbb{E}_{\mathcal{D}}[f(\tau)]||_2^2\leq E^2$, $||\theta_k - \theta^*||_2 \leq R$,  $\max_{\theta}Var_{P}(f(\tau))\leq \sigma_2^2$, $||\mathbb{E}_{P}[f(\tau)^+]||_2^2\leq G^2$, and $||\mathbb{E}_{P}[f(\tau)^-]||_2^2\leq G^2$.
%
Suppose $1\leq\delta\leq\sqrt{2}$ is used in XOR-sampling, the learning rate $\eta\leq \frac{2-\delta^2}{\sigma_2^2\delta}$, and $\overline{\theta_K}$ is the output of X-MEN. We have: % \xyx{need update the bound}: 
\begin{align*}
     \mathbb{E}[L(\overline{\theta_K})]& -OPT 
     \leq\frac{\delta||\theta_0-\theta^*||_2^2}{2\eta K}+\frac{\eta\sigma_1^2}{\delta M_1}+\frac{\eta\sigma_2^2}{\delta M_2}+\\
      &   2(\delta^2-1)(G+E)R + 2\eta (\delta^3-\delta)(G+E)^2.
\end{align*}
\end{Th}

X-MEN is the first provable algorithm which converges to the global optimum of the likelihood function and several tail terms for constrained inverse reinforcement learning problems. Moreover, the rate of the  convergence is linear in the number of SGD iterations $K$. Previous approaches for IRL problems with hard combinatorial constraints do not have such tight bounds. 
%
In the  bound stated above, the first term is inversely proportional to the number of SGD iterations $K$. The second and third terms can be minimized by increasing $M_1$ and $M_2$, i.e., with more samples drawn. The last two terms can be reduced by decreasing $\delta$, i.e., using more precise version of XOR-sampling. 

The main challenge to prove Theorem~\ref{Th:main}
lies in the fact that we cannot ensure the unbiasedness of the 
gradient estimator. 
%
Because the objective is concave with respect to $\theta$ and smooth, a gradient descent algorithm can be proven 
to be linearly convergent towards the optimal value if the 
expectation of the estimated gradient is unbiased, ie, $\mathbb{E}[g_k] = \nabla_{\theta} L(\theta_k)$. 
%
However, even though we apply XOR-sampling, which has
a constant approximation bound in generating 
samples from the model distribution, 
we still cannot guarantee the unbiasedness of $g_k$. 
%
Instead, using the constant factor approximation of XOR-Sampling, which is formally stated in Theorem~\ref{Th:bound2}, the bound
for $g_{k}$ is in the following form %\xyx{This is still not correct? We do not have $\delta^2$...}:
\begin{align}
    \frac{1}{\delta} [\nabla L(\theta_k)]^+ \leq \mathbb{E}[g_k^+]
    \leq \delta [\nabla L(\theta_k)]^+,\label{eq:gb1}\\
    \delta [\nabla L(\theta_k)]^- \leq \mathbb{E}[g_k^-]
    \leq \frac{1}{\delta} [\nabla L(\theta_k)]^-.\label{eq:gb2}
\end{align}
Here, $\delta>1$ is a constant factor, $[f]^+$ means the positive part of $f$, ie,
$[f]^+ = \max\{f, \mathbf{0}\}$, and $[f]^-$ means the negative part of $f$, ie,
$[f]^- = \min\{f, \mathbf{0}\}$. 
%
%The bound in Equation~\ref{eq:gb1} and \ref{eq:gb2}
%can be proven by bounding the nominator and the denominator of Equation \ref{eq:g_theta}, and we leave the proof in the supplementary materials.  % and applying Equation~\ref{eq:bound_eq2}.

The proof of Theorem \ref{Th:main} relies mainly on the following Theorem \ref{Th:XOR_SGD_bound} which bounds the errors of Stochastic Gradient
Descent (SGD) algorithms which only have
access to constant approximate gradient vectors. 
%
Theorem~\ref{Th:XOR_SGD_bound} was proved in \cite{fan2021xorcd}, to help bound the errors of learning an exponential family model.
%
Theorem \ref{Th:XOR_SGD_bound} requires function 
$f$ to be $L$-smooth. $f(\theta)$ is $L$-smooth if and only if $||f(\theta_1) - f(\theta_2)||_2 \leq L ||\theta_1 - \theta_2||_2$. 
%
Notice that the conditions of Theorem \ref{Th:main}
automatically guarantee the $L$-smoothness of the objective and we leave the proof in the appendix.

% \begin{Th}\label{Th:XOR_SGD_bound}\citep{fan2021xorcd}~
% Let $f:\mathbb{R}^d\rightarrow \mathbb{R}$ be a $L$-smooth convex function and $\theta^*=\text{argmin}_{\theta} f(\theta)$. In iteration $k$ of SGD, $g_k$ is the estimated gradient, i.e., $\theta_{k+1}=\theta_{k}-\eta g_k$. If $Var(g_k)\leq \sigma^2$, and there exists $1\leq c\leq\sqrt{2}$ s.t. $\frac{1}{c}[\nabla f(\theta_k)]^+ \leq \mathbb{E}[g_k^+]\leq c[\nabla f(\theta_k)]^+$ and $c[\nabla f(\theta_k)]^- \leq \mathbb{E}[g_k^-]\leq \frac{1}{c}[\nabla f(\theta_k)]^-$, then for any $K>1$ and step size $\eta\leq \frac{2-c^2}{Lc}$, let $\overline{\theta_K}=\frac{1}{K}\sum_{k=1}^K \theta_k$, we have 
% \begin{align}\label{eq:XOR_SGD_bound}
%     \mathbb{E}[f(\overline{\theta_K})]-f(\theta^*)\leq \frac{c||\theta_0-\theta^*||_2^2}{2\eta K}+\frac{\eta\sigma^2}{c}.
% \end{align}
% \end{Th}
\begin{Th}\label{Th:XOR_SGD_bound}\citep{fan2021xorcd}~
Let $f:\mathbb{R}^d\rightarrow \mathbb{R}$ be a $L$-smooth convex function and $\theta^*=\text{argmin}_{\theta} f(\theta)$. In iteration $k$ of SGD, $g_k$ is the estimated gradient, i.e., $\theta_{k+1}=\theta_{k}-\eta g_k$. If $Var(g_k)\leq \sigma^2$, $||\mathbb{E}[g_k^+]||_2 \leq G$, $||\mathbb{E}[g_k^-]||_2 \leq G$, $||\theta_t - \theta^*||_2 \leq R$, and there exists $1\leq c\leq\sqrt{2}$ s.t. $\frac{1}{c}[\nabla f(\theta_k)]^+ \leq \mathbb{E}[g_k^+]\leq c[\nabla f(\theta_k)]^+$ and $c[\nabla f(\theta_k)]^- \leq \mathbb{E}[g_k^-]\leq \frac{1}{c}[\nabla f(\theta_k)]^-$, then for any $K>1$ and step size $\eta\leq \frac{2-c^2}{Lc}$, let $\overline{\theta_K}=\frac{1}{K}\sum_{k=1}^K \theta_k$, we have 
\begin{align}\label{eq:XOR_SGD_bound}
    \mathbb{E}[f(\overline{\theta_K})]-f(\theta^*) & \leq \frac{c||\theta_0-\theta^*||_2^2}{2\eta K}+\frac{\eta\sigma^2}{c} + \nonumber\\
    & 2(c-\frac{1}{c})GR + 2\eta (c-\frac{1}{c})G^2.
\end{align}
\end{Th}

The proof of Theorem \ref{Th:main} is to apply 
Theorem \ref{Th:XOR_SGD_bound} on the objective $L(\theta)$ and noticing that $L(\theta)$ is $L$-smooth when the total variation $Var_P(f(\tau))$ is bounded \citep{fan2021xorcd}.
Theorem~\ref{Th:main} states that in expectation, the difference between the output of X-MEN and the true optimum $OPT$ is bounded by a term that is inversely proportional to the number of iterations $K$ and several tail terms.  %$\frac{\eta\sigma_1^2}{\delta^2 M_1}+\frac{\eta\sigma_2^2}{\delta^2M_2}$.
%To reduce the tail term with fixed steps $\eta$, we can generate more samples at each iteration to reduce the variance (increase $M_1$ and $M_2$). 
%
In addition, to quantify the computational complexity of X-MEN, 
we prove the following theorem in the supplementary materials detailing the number of queries to NP oracles needed for X-MEN.
% \begin{Th}\label{Th:num_queries}
% Let $|\mathcal{S}|$ and $|\mathcal{A}$ be the number of all possible states and all possible actions, respectively, then X-MEN in Algorithm \ref{alg:X-MEN} uses $O\left(K|\mathcal{S}||\mathcal{A}|\ln\frac{|\mathcal{S}||\mathcal{A}|}{\gamma}+KM_2\right)$ queries to NP oracles.
% \end{Th}
\begin{Th}\label{Th:num_queries}
Let $|\mathcal{S}|$ and $|\mathcal{A}|$ be the number of all possible states and all possible actions, respectively, then X-MEN in Algorithm \ref{alg:X-MEN} uses $O (-K|\mathcal{S}||\mathcal{A}|\log(1- 1/\sqrt{\delta}) \log({-|\mathcal{S}||\mathcal{A}|/\gamma\log(1-1/\sqrt{\delta}))} + KM_2 )$ queries to NP oracles.
% $O\left(K|\mathcal{S}||\mathcal{A}|\ln\frac{|\mathcal{S}||\mathcal{A}|}{\gamma}+KM_2\right)$
\end{Th}




% \input{tex/analysis}

%\input{tex/related}
\section{RELATED WORK}
Max-Ent IRL models were first proposed in  \citep{ziebart2008maximum} to addresses the inherent ambiguity of possible reward functions and induced policies for an observed behavior, during the training of which a forward-backward dynamic programming algorithm were used to exactly compute the partition function and marginal probability \citep{snoswell2020revisiting}, assuming the knowledge of the transition probability. Relative Entropy IRL \citep{boularias2011relative} extends this work by leveraging an importance sampling approach to estimate the partition function unbiasedly without knowing the dynamics. Guided Cost Learning \citep{finn2016guided} further learns a Max-Ent model with policy optimization. Later work accommodates arbitrary nonlinear reward functions such as neural networks \citep{finn2016guided,kalweit2020deep,wulfmeier2015maximum}, instead of a linear combination of features. Recently proposed Generative Adversarial Imitation Learning (GAIL) \citep{ho2016generative} is an imitation learning method that does not require estimating likelihoods.
%
However, while Markovian rewards do often provide a succinct and expressive way to specify the objectives of a task, they cannot capture all possible task specifications, especially additional constraints \citep{vazquez2017learning}. Recent work on constrained IRL only focuses on local constraints of states, actions and features \citep{chou2018learning,subramani2018inferring,mcpherson2018modeling}, which can hardly represent all the real world scenarios as most constraints are trajectory long. Other methods focus on learning constraints from the demonstrations, such as maximum likelihood constraint inference \citep{scobee2019maximum,kalweit2020deep,anwar2020inverse,mcpherson2021maximum}. Our approach differs from all the existing methods and addresses the open question of learning with hard combinatorial constraints. We adapt the Max-Ent framework to allow us to reason about all the trajectories that satisfy the constraints during the contrastive learning process. Here we only consider pre-defined constraints. One should notice that even with the full knowledge of transition probability, dynamic programming cannot work well under trajectory-long constraints since it has no knowledge of any hard combinatorial information.
%
X-MEN was motivated by the recent proposed probabilistic inference via hashing and randomization technique for both sampling \citep{ermon2013embed,ivrii2015computing}, counting \citep{Gomes2006NearUniformSampling,ding2019towards}, and marginal inference  problems \citep{Ermon13Wish,kuck2019adaptive,Chakraborty2014DistributionAwareSA,Chakraborty2015WeightedCounting,belle2015hashing} with constant approximation guarantees. Latest work also show the success of XOR-Sampling \citep{ermon2013embed} to boost stochastic optimization algorithms \citep{fan2021xorsgd} and improve machine learning tasks on structure generation \citep{fan2021xorcd}.

%\input{tex/exp}
\section{EXPERIMENTS}

\begin{figure*}[t]
\centering
\subfigure[reward map]{\label{fig:reward}
\includegraphics[width=0.3\linewidth]{figs/grid_reward.pdf}}
\subfigure[ground truth]{\label{fig:gt}
\includegraphics[width=0.3\linewidth]{figs/grid_gt.pdf}}
\subfigure[X-MEN]{\label{fig:x-men}
\includegraphics[width=0.35\linewidth]{figs/grid_recover.pdf}}
\subfigure[Maxent]{\label{fig:maxent}
\includegraphics[width=0.3\linewidth]{figs/grid_maxent.pdf}}
\subfigure[RE-IRL]{\label{fig:re}
\includegraphics[width=0.3\linewidth]{figs/grid_re.pdf}}
\subfigure[MLCI]{\label{fig:mlci}
\includegraphics[width=0.35\linewidth]{figs/grid_mlci.pdf}}
% \hspace{-0.2in}
% \vspace*{-0.1cm}
\caption{The superior performance of X-MEN against baselines in the grid world environment.\textbf{(a)} The ground truth reward map of the $9\times9$ gridworld. The reward of each state is 0, except for $S_G$ which is 1. Red symbols denotes constraints, where the red triangle denotes the state that must be passed through first among all the symbols, red crosses denote the states that can never be passed through, and the agent must pass through only one red square and one red circle. \textbf{(b)-(e)} The marginal probability of passing through each state of the ground truth demonstration and the distribution generated by different learning algorithms. We can see distribution of trajectories from X-MEN matches with the demonstration the most. Neither Maxent IRL nor RE-IRL can handle constraints. While MLCI knows ``where not to go'', it has difficulty in knowing ``where must go'' and we show in Figure \ref{fig:structure} that it can not generate $100\%$  trajectories satisfying constraints.} 
% \vspace{-0.1in}
\label{fig:gridworld}
\end{figure*}

\begin{figure*}[t]
\subfigure[]{\label{fig:sample2sample}
\includegraphics[width=0.32\linewidth]{figs/valid_sample2sample.pdf}}
\subfigure[]{\label{fig:sample2time}
\includegraphics[width=0.32\linewidth]{figs/valid_sample2time.pdf}}
% \hspace{-0.2in}
\subfigure[]{\label{fig:distribution}
\includegraphics[width=0.32\linewidth]{figs/distribution.pdf}}
% \hspace{-0.2in}
% \vspace*{-0.1cm}
\caption{X-MEN outperforms competing approaches by producing 100\% valid trajectories while capturing the inductive bias in demonstration on a $9\times 9$ gridworld benchmark shown in Figure \ref{fig:reward}. (\textbf{Left}) The percentage of valid trajectories generated by different algorithms, varying the number of demonstration trajectories.  (\textbf{Middle}) The percentage of valid trajectories generated by different algorithms varying training time. (\textbf{Right}) The dashed line shows the percentage of valid trajectories generated from different algorithms. The bars show the distributions of these valid trajectories grouped by different types of paths (upper paths or lower paths). X-MEN generates 100$\%$ valid trajectories. The distribution of the trajectories has the minimal KL divergence $0.005$ towards that of the demonstrations.} 
% \vspace{-0.1in}
\label{fig:structure}
\end{figure*}

We conduct experiments similar to those in \cite{scobee2019maximum}, where we first show the superior performance of X-MEN in a synthetic grid world  set of benchmarks. We also demonstrate the performance of X-MEN in mimicking trajectories from human participants as they navigate around obstacles and follow certain constraints on the floor. %To obtain final trajectories, X-MEN first draws trajectories from the proposal distribution, and then \xyx{ATTENTION, not correct after revision}: re-samples from this trajectory pool according to the importance weights. 
For comparison, we compare with classic Max-Ent IRL \citep{ziebart2008maximum}, RE-IRL \citep{boularias2011relative} and recently proposed maximum likelihood constraint inference (MLCI) \cite{scobee2019maximum} which can mask out the ``not to go'' states in the transition distribution. We implement X-MEN using IBM ILOG CPLEX Optimizer 12.63 for queries to NP oracles and XOR-Sampling parameters are same as \cite{fan2021xorcd}. Experiments are carried out on a cluster, where each node has 24 cores and 96GB memory.

% \subsection{Objectworld benchmark}
% \url{https://github.com/MatthewJA/Inverse-Reinforcement-Learning/blob/master/examples/maxent_objectworld.py}\\
% \\
% \url{https://github.com/yrlu/irl-imitation}

% \subsection{Binaryworld benchmark}
% as above.

\subsection{Grid World}
We consider a 9×9 grid world. The state corresponds to the location of the agent on the grid. The agent has three actions for moving up, right, or diagonally to the upper right by one cell. The objective is to move from the starting state in the bottom-left corner $s_0$ to the goal state in the up-right corner $s_G$. Every state-action pair produces a distance feature, and the cumulative reward is inverse proportional to distance, which encourages short trajectories. There are additionally three more types of constraints, denoted as red symbols shown in Figure \ref{fig:reward}. The red triangle denotes the state that must be passed through first among all the symbols, red crosses denote the states that can never be passed through, and the agent must pass through only one red square and one red circle. The demonstration trajectories satisfies all the constraints and have an inductive bias: $70\%$ trajectories move along the upper paths and $30\%$ move along the lower path.

Due to the presence of hard constraints, recovering the reward map cannot be considered as the sole  performance metric for a learning algorithm.
%
In fact, an IRL agent with the groundtruth reward map may produce sub-optimal actions if he violates constraints. 
%
Therefore, we show in Figure \ref{fig:gt}-\ref{fig:mlci} the marginal distributions of passing each grid cell generated by aggregating 100 trajectories produced by different learning algorithms and the groundtruth demonstrations. We can see distribution of trajectories from X-MEN matches the demonstrations the most. Neither Maxent IRL nor RE-IRL can handle constraints. While MLCI knows ``where not to go'', it has difficulty in knowing ``where must go'' as the probability of the state marked as triangle is not 1 (we constrain that the agent must go through the triangle). Figure \ref{fig:structure} further computes the percentage of valid trajectories generated by different algorithms varying the number of demonstration trajectories (\ref{fig:sample2sample}) and training time (\ref{fig:sample2time}). X-MEN always generates $100\%$ valid trajectories while the competing methods satisfy no more than $50\%$. Moreover, we can see from the trend that even we keep increasing the number of demonstrations and the training time, the increase in baseline performance is minimal.
%
Figure \ref{fig:distribution} compares the recovered distribution of the trajectories, where we can see X-MEN has the minimal KL divergence 0.005 towards the ground truth distribution of demonstration. The other baselines produce trajectories that significantly differ from the demonstrations (with larger KL-divergence). 


\subsection{Human Obstacle Avoidance}
\begin{figure}[t]
\centering
\includegraphics[width=0.7\linewidth]{figs/obstacle.pdf}
% \vspace*{-0.3cm}
\caption{Overlaid trajectories generated by X-MEN after learning human preferences. The goal is to move from $S_0$ to $S_G$ and the action space contains only going up and right. The shaded regions represent obstacles in the human’s environment, and the red circle represent a ``must pass'' point. Additional constraints are that human cannot take the same action consecutively for 3 times. We can see the generated trajectories from X-MEN satisfy all the constraints and follow the shortest possible paths, similar to what human demonstrators' actions.}\label{fig:obstacle}
%\vspace*{-0.3cm}
\end{figure}

In our second example, we analyze trajectories from human beings as they navigate around obstacles on the floor and follow certain constraints. We map these continuous trajectories in a grid world where each cell represents a a 1ft-by-1ft area on the ground. The state corresponds to the location of the agent in the grid. The human agents are attempting to reach a fixed goal state $S_G$ from a given initial state $S_0$, as shown in Figure \ref{fig:obstacle}. 
%
The agent has only two actions for moving up or moving right. The shaded regions represent obstacles in the human’s environment that cannot be passed through, and the red circle represent a ``must pass'' choke point that every person has to walk through. Additional hard constraints are that human cannot take the same action consecutively for more than 3 times.

Demonstrations were collected from 10 volunteers, who want to move from the start state to the goal state without violating any constraints. 
%
Empirical observations reveal that volunteers tend to follow the shortest paths given these constraints. We train both our model and the competing approaches using these demonstrations within the same training time of $4$ hours and use 16 trajectory samples in each SGD iteration. Generated trajectories from X-MEN are shown in Figure \ref{fig:obstacle}, where we can see X-MEN is able to successfully avoid obstacles and pass the ``must go'' choke point. The 10 generated trajectories shown in the figure are indeed the shortest paths from the start state to the goal (matching human demonstrations). Competing approaches do not generate trajectories that satisfy constraints, while the trajectories generated by X-MEN are $100\%$ valid. What worths noting is that X-MEN learns to go up first before passing through the gap between two obstacles, because otherwise the trajectory has to violate the constraint of taking the same action consecutively for more than 3 times.



%\input{tex/conclusion}
\section{CONCLUSION}
We proposed X-MEN, a novel XOR maximum entropy framework for constrained Inverse Reinforcement Learning.
%
We showed theoretically that X-MEN converges in linear speed towards the global optimum of the likelihood  function for solving IRL problems. Empirically, we demonstrated the superior performance of X-MEN on two navigation tasks with additional hard combinatorial constraints. 
%
In all tasks, X-MEN generates 100\% valid samples and the generated trajectories closely match the distribution of the training set. 
%
%Despite its benefits, one drawback of our approach is that
%the formulation is based on assuming the knowledge of transition probability, thus limits the broader impact of our method towards more real-world applications.
%
For future work, we would like to extend X-MEN to model-free reinforcement learning while preserving the  theoretical guarantees. We also intend to test richer representations of the reward function in form of  deep networks on real-world, large-scale constrained IRL tasks.


%\clearpage
%\begin{contributions} % will be removed in pdf for initial submission,
                      % so you can already fill it to test with the
                      % ‘accepted’ class option
%    Briefly list author contributions.
%    This is a nice way of making clear who did what and to give proper credit.

%    H.~Q.~Bovik conceived the idea and wrote the paper.
%    Coauthor One created the code.
%    Coauthor Two created the figures.
%\end{contributions}

\begin{acknowledgements} % will be removed 
%in pdf for initial submission,
This research is supported by NSF grant CCF-1918327.
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
%    Briefly acknowledge people and organizations here.

%    \emph{All} acknowledgements go in this section.
\end{acknowledgements}
%\clearpage
\bibliography{fan}

%\appendix
% NOTE: necessary when ptmx or no mathfont class option is given
\end{document}
