Abstract: The relationship between inverse reinforcement learning (IRL) and inverse optimization (IO) for Markov decision processes (MDPs) has been relatively underexplored in the literature, despite addressing the same problem. In this work, we revisit the relationship between the IO framework for MDPs, IRL, and apprenticeship learning (AL). We incorporate prior beliefs on the structure of the cost function into the IRL and AL problems, and demonstrate that the convex-analytic view of the AL formalism (Kamoutsi et al., 2021) emerges as a relaxation of our framework. Notably, the AL formalism is a special case in our framework when the regularization term is absent. Focusing on the suboptimal expert setting, we formulate the AL problem as a regularized min-max problem. The regularizer plays a key role in addressing the ill-posedness of IRL by guiding the search for plausible cost functions. To solve the resulting regularized-convex-concave-min-max problem, we use stochastic mirror descent (SMD) and establish convergence bounds for the proposed method. Numerical experiments highlight the critical role of regularization in learning cost vectors and apprentice policies.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: In response to the reviewer’s feedback, we made the following revisions to the manuscript:
$\textbf{Clarified problem setting and objective:}$ We clarified the assumptions on the generative-model oracle and the specific problem we are solving in section 2.5. Furthermore, we instantiated problem $IRL-IO_{\hat{\mathbf{c}}}$ to handle the case of expert optimality and $\text{IO-AL}_{\alpha}$ for expert suboptimality, note that we focus on expert suboptimality. Furthermore, we provided intuitive graphical explanations for each problem with Figure 1 and 2.
$\textbf{Claims and evidence:}$ We added Proposition 3 for characterizing the optimal solution of $IO-AL_{\alpha}$ and Proposition 4 that relates an expected $\epsilon$-approximate solution (generated by Algorithm 1) to the optimal solution of $\text{IO-AL}_{\alpha}$. Moreover, we clarified that the claim of Theorem 2 is based on expectation and not probabilistic.
$\textbf{Updated literature review:}$ We expanded the introduction with related work regarding inverse reinforcement learning with generative-model oracles for the transition dynamics.
$\textbf{Experiments:}$ We rethought the experimental design for the $\text{IO-AL}_{\alpha}$ problem and conducted new experiments using a suboptimal expert instead of an optimal one.
$\textbf{Typos + notation:}$ We corrected typos and added explanations for the notations that were previously unclear in the earlier version of the manuscript.
Assigned Action Editor: ~Alberto_Maria_Metelli2
Submission Number: 4378
Loading