Rethinking ValueDice - Does It Really Improve Performance?

This post rethinks the ValueDice algorithm introduced in the following ICLR publication. We promote several new conclusions and perhaps some of them can provide new insights.

Kostrikov, Ilya, Ofir Nachum, and Jonathan Tompson. "Imitation learning via off-policy distribution matching." ICLR 2020.

Overview

This blog is organized in the following way.

  1. Why ValueDice?
  2. Background
  3. Rethink ValueDice in the Offline Setting
    1. Connection Between ValueDice and BC
    2. Overfitting of ValueDice and BC
    3. Discussion: Why BC Performs Well?
  4. Rethink ValueDice in the Online Setting
  5. Conclusion

This blog may take 10~15 minutes to read.

Why ValueDice?

Many practical applications involve sequential decision-making. For these applications, an agent implements a policy to select actions and maximize the long-term return. Imitation learning approaches obtain the optimal policy from expert demonstrations [Argall et al., 2009, Hussein et al., 2017, Osa et al., 2018]. Imitation learning has been successfully applied in game [Ross et al., 2011, Silver et al., 2016], recommendation system [Shi et al., 2019, Chen et al., 2019], and robotics [Levine et al., 2016, Finn et al., 2016], etc.

One of the milestones in imitation learning is the introduction of generative adversarial imitation learning (GAIL) [Ho and Ermon, 2016]. Different from the classical approach Behavioral Cloning (BC) that trains a policy via supervised learning, GAIL performs state-action distribution matching in an adversarial manner. The amazing part about GAIL is that even when expert demonstrations are scarce, GAIL is empirically shown to match the expert performance while BC cannot. However, the price to pay is that GAIL requires a number of environment interactions (e.g., 3M for Gym MuJoCo tasks), which restricts GAIL to be applied in the online setting. In contrast, BC does not require any interaction thus BC is widely applied in the offline setting.

The remarkable achievements of GAIL attract lots of research interests in adversarial imitation learning (AIL) approaches. Among these methods, ValueDice [Kostrikov et al., 2020] has the following two improvements.

  1. As an off-policy algorithm, ValueDice is empirically shown to beat BC in the offline setting. In contrast, previous AIL algorithms (e.g., GAIL), that performs state-action distribution matching, cannot even work in the offline setting.
  2. ValueDice outperforms another off-policy AIL algorithm DAC [Kostrikov et al., 2019] in terms of interaction efficiency. Specifically, for many Gym MuJoCo tasks, ValueDice only requires about 50K interaction steps (v.s. 3M of GAIL).

All existing results suggest ValueDice is perfect. This motivates the central question in this manuscript: are these improvements benefited from more advanced algorithm designs? Before answering this question, we have to carefully examine the achievements of ValueDice.

  1. For the comparison with BC, the empirical advance of ValueDice contradicts the recent lower bound that BC is (nearly) minimax optimal in the offline setting [Rajaraman et al., 2020, Xu et al., 2021a]. In other words, this theoretical result suggests that no algorithm can beat BC in the offline setting.
  2. For the online part, it is well accepted that optimization of Bellman backup objectives with function approximation can easily diverge (see the counter-example in [Baird, 1995]; please also refer to [Mnih et al., 2015, Zhang et al., 2021, Agarwal et al., 2021]). However, unlike DAC, ValueDice successfully performs off-policy optimization without the useful technique target network.

In this blog, we will provide explanations for these ‘‘contradictions’’ when we answer the raised research question.

[click here to go to the top][click here to go to the reference]

Background

In the beginning, you can click the following buttons to obtain basic knowledge about Markov Decision Process, Imitation Learning, Behavioral Cloning, and ValueDice. We suggest you still go through these contents quickly even though you are familiar with the covered materials.

Markov Decision Process Following the setup in [Kostrikov et al., 2020], we consider the infinite-horizon Markov decision process (MDP) [Puterman, 2014], which can be described by the tuple $\mathcal{M} = (\mathcal{S}, \mathcal{A}, p, r, p_0, \gamma)$. Here $\mathcal{S}$ and $\mathcal{A}$ are the state and action space, respectively. $p_0(\cdot)$ specifies the initial state distribution while $\gamma \in [0, 1)$ is the discount factor. For a specific state-action pair $(s, a)$, $p(s' \mid s, a)$ defines the transition probability of the next state $s'$ and $r(s, a) \in [0, 1]$ assigns a reward signal.

To interact with the environment (i.e., an MDP), a stationary policy $\pi: \mathcal{S} \rightarrow \Delta(\mathcal{A})$ is introduced, where $\Delta(\mathcal{A})$ is the set of probability distributions over the action space $\mathcal{A}$. Specifically, $\pi(a \mid s)$ determines the probability of selecting action $a$ at state $s$. The performance of policy $\pi$ is measured by policy value $V(\pi)$, which is defined as the cumulative and discounted rewards. That is, $$\begin{align*} V(\pi) := \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^{t} r(s_t, a_t) \mid s_0 \sim p_0(\cdot), a_t \sim \pi(\cdot|s_t), s_{t+1} \sim p(\cdot|s_t, a_t) \right]. \end{align*}$$ To facilitate later analysis, we need to introduce the (discounted) state-action distribution $d^{\pi}(s, a)$: $$ \begin{align*} d^{\pi}(s, a) := (1-\gamma) \sum_{t=0}^{\infty} \gamma^{t}\mathbb{P} \left( s_t= s, a_t =a \mid s_0 \sim p_0(\cdot), a_t \sim \pi(\cdot \mid s_t), s_{t+1} \sim p(\cdot \mid s_t, a_t) \right). \end{align*}$$ With a little of notation abuse, we can define the (discounted) state distribution $d^{\pi}(s)$: $$\begin{align*} d^{\pi}(s) := (1-\gamma) \sum_{t=0}^{\infty} \gamma^{t} \mathbb{P} \left( s_t= s \mid s_0 \sim p_0(\cdot), a_t \sim \pi(\cdot \mid s_t), s_{t+1} \sim p(\cdot \mid s_t, a_t) \right). \end{align*}$$
Imitation Learning The goal of imitation learning is to learn a high-quality policy from expert demonstrations. To this end, we often assume there is a nearly optimal expert policy $\pi^{\operatorname{exp}}$ that could interact with the environment to generate a dataset $\mathcal{D} = \{ (s_i, a_i) \}_{i=1}^{m}$. Then, the learner can use the dataset $\mathcal{D}$ to mimic the expert and to obtain a good policy. The quality of imitation is measured by the policy value gap $V(\pi^{\operatorname{exp}}) - V(\pi)$. The target of imitation learning is to minimize the policy value gap: $$\begin{align*} \min_{\pi} V(\pi^{\operatorname{exp}}) - V(\pi). \end{align*}$$ Notice that this policy optimization is performed without the reward signal. In this manuscript, we mainly focus on the case where the expert policy is deterministic, which is true in many applications and is widely used in the literature.
Behavioral Cloning Given the state-action pairs provided by the expert, Behavioral Cloning (BC) [Pomerleau, 1991] directly learns a policy mapping to minimize the policy value gap. Specifically, BC often trains a parameterized classifier or regressor with the maximum likelihood estimation: \begin{align} \tag{1} \label{eq:bc_loss} \min_{\theta} \sum_{(s, a) \in \mathcal{D}} -\log \pi_{\theta}(a \mid s), \end{align} which can be viewed as the minimization of the KL-divergence between the expert policy and the learner's policy [Ke et al., 2019, Ghasemipour et al., 2019, Xu et al., 2020]: \begin{align*} \min_{\pi} \mathbb{E}_{s \sim d^{\operatorname{exp}}(\cdot)} \left[ D_{\operatorname{KL}} (\pi^{\operatorname{exp}}(\cdot \mid s) || \pi(\cdot|s) ) \right], \end{align*} where $D_{\operatorname{KL}}(p||q) = \sum_{x \in \mathcal{X}} p(x) \log (p(x)/q(x))$ for two distributions $p$ and $q$ on the common support $\mathcal{X}$.

Under the tabular setting where the state space and action space are finite, the optimal solution to objective \eqref{eq:bc_loss} could take in a simple form: \begin{align} \tag{2} \label{eq:bc_count} \pi^{\star}(a|s) = \left\{ \begin{array}{cc} \frac{ \sum_{(s^\prime, a^\prime) \in \mathcal{D}} \mathbb{I}(s^\prime = s, a^\prime = a)}{\sum_{(s^\prime, a^\prime) \in \mathcal{D}} \mathbb{I}(s^\prime = s)} & \quad \text{if} \,\, \sum_{(s^\prime, a^\prime) \in \mathcal{D}} \mathbb{I}(s^\prime = s) > 0 \\ \frac{1}{|\mathcal{A}|} & \quad \text{otherwise} \end{array} \right. \end{align} That is, BC estimates the empirical action distribution of the expert policy on visited states from the expert demonstrations.
ValueDice With the state-action distribution matching, ValueDice [Kostrikov et al., 2020] implements the objective: \begin{align} \tag{3} \max_{\pi} -D_{\operatorname{KL}} ( d^{\pi} || d^{\operatorname{exp}} ) &:= \mathbb{E}_{(s, a) \sim d^{\pi}} \left[ \log \frac{d^{\operatorname{exp}} (s, a)}{d^{\pi} (s, a)} \right] \label{eq:state_action_kl_matching_1} \\ \tag{4} &= \mathbb{E}\left[ \sum_{t=0}^{\infty} \gamma^{t} \log \frac{d^{\operatorname{exp}} (s_t, a_t)}{d^{\pi} (s_t, a_t)} \mid s_0 \sim p_0(\cdot), a_t \sim \pi(\cdot|s_t), s_{t+1} \sim p(\cdot|s_t, a_t) \right] . \label{eq:state_action_kl_matching_2} \end{align} That is, \eqref{eq:state_action_kl_matching_2} amounts a max-return RL problem with the reward $\widetilde{r}(s, a) = \log(d^{\operatorname{exp}} (s, a))/ \log (d^{\pi}(s, a))$. In ValueDice, a dual form of \eqref{eq:state_action_kl_matching_2} is presented: \begin{align*} -D_{\operatorname{KL}} ( d^{\pi} || d^{\operatorname{exp}} ) = \min_{x: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}} \left[ \log \mathbb{E}_{(s, a) \sim d^{\operatorname{exp}}} [e^{x(s, a)}] - \mathbb{E}_{(s, a) \sim d^{\pi}}[x(s, a)] \right]. \end{align*} With the trick of variable change, Kostrikov et al. [2020] derived the following optimization problem: \begin{align} \tag{5} \label{eq:value_dice_objective} \max_{\pi} \min_{\nu: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R} } J_{\operatorname{dice}} (\pi, \nu):= \log \left( \mathbb{E}_{(s, a) \sim d^{\operatorname{exp}}}[e^{\nu(s, a) - \mathcal{B}^\pi \nu(s, a)}] \right( - (1-\gamma) \cdot \mathbb{E}_{s_0 \sim p_0(\cdot), a_0 \sim \pi(\cdot|s_0)}[\nu(s_0, a_0)], \end{align} where $\mathcal{B}^{\pi}\nu(s, a)$ performs one-step Bellman update (with zero reward): \begin{align*} \mathcal{B}^{\pi}\nu(s, a) = \gamma \mathbb{E}_{s^\prime \sim p(\cdot \mid s, a), a^\prime \sim \pi(\cdot |s^\prime)}\left[ \nu(s^\prime, a^\prime) \right]. \end{align*} As noted in [Kostrikov et al., 2020], objective \eqref{eq:value_dice_objective} only involves the samples from the expert demonstrations to update, which may lack diversity and hamper training. To address this issue, Kostrikov et al. [2020] proposed to use samples from the replay buffer in the online setting. Concretely, an alternative objective is introduced: \begin{align*} J_{\operatorname{dice}}^{\operatorname{mix}} &:= \log \left( \mathbb{E}_{(s, a) \sim d^{\operatorname{mix}}}[e^{\nu(s, a) - \mathcal{B}^\pi \nu(s, a)}] \right( - (1-\alpha) (1-\gamma) \cdot \mathbb{E}_{s_0 \sim p_0(\cdot), a_0 \sim \pi(\cdot|s_0)}[\nu(s_0, a_0)] \\ &\quad - \alpha \mathbb{E}_{(s, a) \sim d^{\operatorname{RB}}}\left[ \nu(s, a) - \mathcal{B}^{\pi} \nu(s, a) \right], \end{align*} where $d^{\operatorname{mix}}(s, a) = (1-\alpha) d^{\operatorname{exp}}(s, a) + \alpha d^{\operatorname{RB}}(s, a)$. Under this case, VauleDice is to perform the modified state-action distribution matching: \begin{align*} \max_{\pi}\quad -D_{\operatorname{KL}} ( (1-\alpha) d^{\pi} + \alpha d^{\operatorname{RB}} || (1-\alpha) d^{\operatorname{exp}} + \alpha d^{\operatorname{RB}} ). \end{align*} As long as $\alpha \in [0, 1)$, the global optimality of $\pi = \pi^{\operatorname{exp}}$ is reserved. In the offline setting, we should set $\alpha = 0$.

[click here to go to the top][click here to go to the reference]

Rethink ValueDice in the Offline Setting

In this section, we revisit ValueDice in the offline setting, where only the expert dataset is provided and environment interaction is not allowed. Under this case, the simple algorithm BC can be directly applied while on-policy adversarial imitation learning (AIL) methods (e.g., GAIL) cannot. This is because the latter often requires environment interactions to evaluate the state-action distribution $d^{\pi}(s, a)$ to perform optimization.

In the offline setting, the classical theory suggests that BC suffers the compounding errors issue [Ross et al., 2011, Xu et al., 2020]. Specifically, once the agent makes a wrong decision (i.e., an imperfect imitation due to finite samples), it may visit an unseen state and make the wrong decision again. In the worst case, such decision errors accumulate over $H$ time steps and the agent could obtain zero reward along such a bad trajectory. This explanation often coincides with the empirical observation that BC performs poorly when expert demonstrations are scarce. Thus, it is definitely interesting and valuable if some algorithms can perform better than BC in the offline setting[1]. Unfortunately, the information-theoretical lower bound [Rajaraman et al., 2020, Xu et al., 2021a] suggests that BC is (nearly) minimax optimal in the offline setting.

Theorem 1 (Lower Bound [Xu et al., 2021a]) Given expert dataset $\mathcal{D} = { (s_i, a_i) }_{i=1}^m$ which is i.i.d. drawn from $d^{\operatorname{exp}}$, for any offline algorithm $\mathrm{Alg}: \mathcal{D} \rightarrow \widehat{\pi}$, there exists a constant $\delta_0 \in (0, 1/10]$, an MDP $\mathcal{M}$ and a deterministic expert policy $\pi^{\operatorname{exp}}$ such that, for any $\delta \in (0, \delta_0)$, with probability at least $1-\delta$, the policy value gap has a lower bound:

\[\begin{align*} V({\pi^{\operatorname{exp}}}) - V({\widehat{\pi}}) \succsim \mathrm{min} \left( \frac{1}{1-\gamma}, \frac{\vert \mathcal{S} \vert}{(1-\gamma)^2 m} \right). \end{align*}\]

Remark 1: This lower bound implies no algorithm can beat BC in the worst case since the sample complexity of BC is $\widetilde{\mathcal{O}}({ \vert \mathcal{S} \vert } /{((1-\gamma)^2 m)})$ [Xu et al., 2021a], which matches the lower bound up to logarithmic terms. However, we do not know whether (off-policy based) AIL could match this lower bound in the offline setting.

From the empirical side, we realize that Kostrikov et al. [2020] showed that their proposed algorithm ValueDice could be better than BC in the offline setting; see [Kostrikov et al., 2020, Figure 4] or the reproduced results in the following Figure 1[2]. This manifests a gap between theory and practice. In the sequel, we examine the achievement of ValueDice and explain this gap.

Missing images
Figure 1: Comparison of ValueDice and BC in the offline setting with 1 expert trajectory, which reproduces [Kostrikov et al., 2020, Figure 4]. Dashed lines correspond to the performance of expert policies.

Connection Between ValueDice and BC

As a first step, let us review the training objective of ValueDice in the offline scenario. For ease of presentation, let us consider the case where the number of expert trajectory is 1, which follows the experimental setting used in Figure 1. As a result, the training objective of ValueDice with $T$ samples is formulated as

\[\begin{align} \tag{6} \label{eq:value_dice_empirical_surrogate_objective} J_{\operatorname{dice}}^{\operatorname{offline}}(\pi, \nu) &:= \log \left( \sum_{t=1}^{T} \left[ e^{\nu(s_t, a_t) - \gamma \mathbb{E}_{\widetilde{a}_{t+1} \sim \pi(\cdot \mid s_{t+1})} [ \nu(s_{t+1}, \widetilde{a}_{t+1}) ] } \right] \right) - (1-\gamma) \sum_{t=1}^{T} \mathbb{E}_{\widetilde{a}_t \sim \pi (\cdot \mid s_t) }[\nu(s_t, \widetilde{a}_t)], \end{align}\]

Remark 2: Even derived from the principle of state-action distribution matching (i.e., objective \eqref{eq:state_action_kl_matching_1}), in the offline setting, ValueDice cannot enjoy the benefit of state-action distribution matching as explained in [Xu et al., 2021b, Section 6]. This is because objective \eqref{eq:value_dice_empirical_surrogate_objective} only involves states collected by the expert policy. That is, there is no guidance for policy optimization on non-visited states. This is the main difference between the online setting and offline setting. As a result, we cannot explain the experiment results in Figure 1 by the tailored theory for online state-action distribution matching in [Xu et al., 2021b].

Benefits of Online State-Action Distribution Matching (Optional) According to the mathematical models in [Rajaraman et al., 2020], BC has two features: it performs perfectly on visited states in the expert demonstrations by duplicating the expert action; however, it may make a mistake on non-visited states since BC is restricted to perform optimization on visited states. In contrast, state-action distribution matching methods (i.e., AIL methods), can perform policy optimization on non-visited states by online interaction. Under mild assumptions for the MDPs, AIL could select ''good'' actions on non-visited states and therefore enjoy a better performance [Xu et al., 2021b]. $$\tag*{$\blacksquare$}$$

To obtain some intuition about the optimal solutions to the empirical objective \eqref{eq:value_dice_empirical_surrogate_objective}, we can further consider the case where γ = 0. Note that even this variant could also outperform BC in the offline setting (see Figure 2).

Theorem 2: Consider the tabular MDPs. If $\gamma = 0$, objective \eqref{eq:value_dice_empirical_surrogate_objective} reduces to \begin{align} \tag{7} \label{eq:value_dice_empirical_surrogate_objective_gamma_0} J_{\operatorname{dice}}^{\operatorname{offline}, \gamma=0}(\pi, \nu) &:= \log \left( \sum_{t=1}^{T} e^{\nu(s_t, a_t)} \right) - \sum_{t=1}^{T} \mathbb{E}_{\widetilde{a}_t \sim \pi (\cdot \mid s_t) }[ \nu(s_t, \widetilde{a}_t)]. \end{align} Importantly, the BC’s solution as in \eqref{eq:bc_count} is also one of the globally optimal solutions for ValueDice’s objective in \eqref{eq:value_dice_empirical_surrogate_objective_gamma_0}.

Proof Sketch of Theorem 2 For tabular MDPs, each element is independent. As a consequence, for a specific state-action pair $(s_t, a_t)$ appeared in the expert demonstrations, its optimization objective is \begin{align*} \max_{\pi(\cdot \mid s_t) \in \Delta(\mathcal{A})} \quad \min_{\nu(s_t, a_t) \in \mathbb{R}} \nu(s_t, a_t) + b - \mathbb{E}_{ \widetilde{a}_t \sim \pi (\cdot \mid s_t)}[\nu(s_t, \widetilde{a}_t)], \end{align*} where $b$ is unrelated to the optimization variable. This optimization problem has equilibrium points when $\pi(a_t|s_t) = 1$ and $\pi(a |s_t) = 0$ for any $a \ne a_t$. To this end, the ''counting'' solution in \eqref{eq:bc_count} for BC is also one of the globally optimal solutions for the empirical objective \eqref{eq:value_dice_empirical_surrogate_objective_gamma_0}. $$\tag*{$\blacksquare$}$$
Missing images
Figure 2: Comparison of BC and ValueDice's variants (with different discount factors $\gamma$ and orthogonal regularization coefficients $\lambda$) in the offline setting with 1 expert trajectory. The default configuration for ValueDice is $\gamma=0.99$ and $\lambda = 0.0001$. Dashed lines correspond to the performance of expert policies.

We remark that Theorem 2 provides a restricted case where ValueDice degenerates to BC in the offline setting; see the experimental verification in Figure 3. As we have analyzed in Remark 2, the same intuition is expected to hold when $\gamma > 0$ as ValueDice does not have guidance on non-visited states. However, it is technically challenging to draw this conclusion. Instead, we consider the general state-action distribution matching principle[3]. In particular, we claim that this conclusion still holds for tabular MDPs with finite horizons.

Missing images
Figure 3: Comparison of BC and ValueDice in terms of MSE (i.e., BC’s loss) in the offline setting with 1 expert trajectory. This result suggests that ValueDice is closely related to BC in the offline setting.

Theorem 3: Consider the tabular MDPs with finite horizons. Assume that there is 1 expert trajectory provided. In the offline setting, we have that BC’s solution is the unique globally optimal solution to $\ell_1$-norm based state-action distribution matching (i.e., $\min_{\pi} \Vert d^{\pi} - \widehat{d^{\operatorname{exp}}} \Vert_1 $), where $\widehat{d^{\operatorname{exp}}}$ is the estimated expert state-action distribution from expert demonstrations.

Proof Sketch of Theorem 3 For MDPs with finite horizon $H$, consider the following optimization problem: \begin{align*} \min_{\pi} \sum_{t=1}^{H} \sum_{(s, a) \in \mathcal{S} \times \mathcal{A}} \, \, &\left\vert d^{\pi}_t(s, a) - \widehat{d^{\operatorname{exp}}_t}(s, a) \right\vert, \\ \text{such that} \,\, & \pi_t(a \mid s) = 1/|\mathcal{A}|, \forall s, t \text{ with } \widehat{d^{\operatorname{exp}}_t}(s) = 0, \end{align*} In particular, the objective corresponds to $\ell_1$-norm based state-action distribution matching and the constraints corresponds to the offline policy optimization (i.e., constraints specify a (default) uniform distribution for non-visited states).
To facilitate later analysis, we will use $a^{\operatorname{expert}}$ to denote the expert action at a specific state $s$ (recall that we assume the expert policy is deterministic). Without loss of generality, let $\widetilde{s}$ be visited state occurred in the single expert trajectory.
We want to mimic the classical dynamic programming proof as in [Bertsekas, 2012]. Specifically, we aim to prove that the BC's policy is optimal in a backward way. To this end, we define the cost function at time step $h$ as \begin{align*} \sum_{(s, a) \in \mathcal{S} \times \mathcal{A}} \left\vert d^{\pi}_t(s, a) - \widehat{d^{\operatorname{expert}}_t}(s, a) \right\vert. \end{align*} Then, we can define the cost-to-go function: \begin{align*} \ell_{h} = \sum_{t=h}^{H} \sum_{(s, a) \in \mathcal{S} \times \mathcal{A}} \left\vert d^{\pi}_t(s, a) - \widehat{d^{\operatorname{expert}}_t}(s, a) \right\vert. \end{align*} At the last time step $h=H$, we have that \begin{align*} \ell_{H} &= \sum_{(s, a)} \left\vert d^{\pi}_H(s, a) - \widehat{d^{\operatorname{expert}}_H}(s, a) \right\vert \\ &= \sum_{a} \left\vert d^{\pi}_{H}(\widetilde{s}, a) - \widehat{d^{\operatorname{expert}}_H}(\widetilde{s}, a) \right\vert + \sum_{s \ne \widetilde{s}} \sum_{a} \left\vert d^{\pi}_H(s, a) - \widehat{d^{\operatorname{expert}}_H}(s, a) \right\vert \\ &= \left\vert d^{\pi}_{H}(\widetilde{s}, a^{\operatorname{expert}}) - 1 \right\vert + \sum_{a \ne a^{\operatorname{expert}}} \left\vert d^{\pi}(\widetilde{s}, a^{\operatorname{expert}}) - 0 \right\vert + \sum_{s \ne \widetilde{s}} \sum_{a} \left\vert d^{\pi}_H(s, a) - 0 \right\vert \\ &= \left\vert d^{\pi}_H(\widetilde{s}) \pi_{H}(a^{\operatorname{expert}}| \widetilde{s}) - 1 \right\vert + d^{\pi}_H(\widetilde{s}) ( 1 - \pi_H(a^{\operatorname{expert}} | \widetilde{s})) + \sum_{s \ne \widetilde{s}} \sum_{a} d^{\pi}_H(s) \frac{1}{|\mathcal{A}|} \\ &= 1 + d^{\pi}_H(\widetilde{s}) - 2 d^{\pi}_H(\widetilde{s}) \pi_{H}(a^{\operatorname{expert}}|\widetilde{s}) + \sum_{s \ne \widetilde{s}} d^{\pi}_H(s) \\ &= 1 + d^{\pi}_H(\widetilde{s}) - 2 d^{\pi}_H(\widetilde{s}) \pi_{H}(a^{\operatorname{expert}}|\widetilde{s}) + (1 - d^{\pi}_H(\widetilde{s})) \\ &= 2 \left( 1 - d^{\pi}_H(\widetilde{s}) \pi_H(a^{\operatorname{expert}} | \widetilde{s}) \right), \end{align*} which has a globally optimal solution at $\pi_H(a^{\operatorname{expert}}| \widetilde{s}) = 1$. Then, we have that $\ell_H^{\star} = 2 \left( 1 - d^{\pi}_H(\widetilde{s}) \right)$. We have the following decomposition for $d^{\pi}_H(\widetilde{s})$: \begin{align*} d^{\pi}_H(\widetilde{s}) & = \sum_{(s^\prime, a^\prime)} p_{H-1}(\widetilde{s}|s^\prime, a^\prime) d_{H-1}^{\pi}(s^\prime, a^\prime) \\ &= \sum_{a^\prime }p_{H-1}(\widetilde{s} | \widetilde{s}, a^{\prime}) d^{\pi}_{H-1}(\widetilde{s}, a^\prime) + \sum_{s^\prime \ne \widetilde{s}} \sum_{a^\prime} p_{H-1}(\widetilde{s}|s^\prime, a^\prime) d_{H-1}^{\pi}(s^\prime, a^\prime) \\ &= \sum_{a^\prime }p_{H-1}(\widetilde{s} | \widetilde{s}, a^{\prime}) d^{\pi}_{H-1}(\widetilde{s}) \pi_{H-1}(a^\prime | \widetilde{s}) + \sum_{s^\prime \ne \widetilde{s}} \sum_{a^\prime} p_{H-1}(\widetilde{s}|s^\prime, a^\prime) d^{\pi}_{H-1}(s^\prime) \frac{1}{|\mathcal{A}|} \\ &= p_{H-1}(\widetilde{s} | \widetilde{s}, a^{\operatorname{expert}}) d^{\pi}_{H-1}(\widetilde{s}) \pi_{H-1}(a^{\operatorname{expert}} | \widetilde{s}) + \sum_{a^\prime \ne a^{\operatorname{expert}}} p_{H-1}(\widetilde{s} | \widetilde{s}, a^{\prime}) d^{\pi}_{H-1}(\widetilde{s}) \pi_{H-1}(a^\prime|\widetilde{s}) \\ &\quad + \sum_{s^\prime \ne \widetilde{s}} \sum_{a^\prime} p_{H-1}(\widetilde{s}|s^\prime, a^\prime) d^{\pi}_{H-1}(s^\prime)\frac{1}{|\mathcal{A}|} . \end{align*} At time step $h = H-1$, we have that \begin{align*} \ell_{H-1}^{\star} &= \min_{\pi_{H-1}} \quad \sum_{(s, a)} \left\vert d^{\pi}_{H-1}(s, a) - \widehat{d^{\operatorname{expert}}_{H-1}}(s, a) \right\vert + \ell_H^{\star} \\ &= \min_{\pi_{H-1}} \quad 2 \left( 1 - d^{\pi}_{H-1}(\widetilde{s}) \pi_{H-1}(a^{\operatorname{expert}} | \widetilde{s}) \right) + \ell_{H}^{\star} . \end{align*} The related optimization problem at time step $h = H -1 $ is that \begin{align*} g(\pi_{H-1}) = \left[ - 1 - p_{H-1}(\widetilde{s} | \widetilde{s}, a^{\operatorname{expert}}) \right] \pi_{H-1}(a^{\operatorname{expert}} | \widetilde{s}) - \sum_{a^\prime \ne a^{\operatorname{expert}}} p_{H-1}(\widetilde{s} | \widetilde{s}, a^{\prime}) \pi_{H-1}(a^\prime | \widetilde{s}). \end{align*} It is obvious that the coefficient for the expert action is $-1 - p_{H-1}(\widetilde{s} | \widehat{s}, a^{\operatorname{expert}})$, which is strictly smaller than the coefficient $p_{H-1}(\widetilde{s} | \widetilde{s}, a^\prime)$ for the non-expert action. As a result, $\ell_{H-1}$ has the globally optimal solution at $\pi_{H-1}(a^{\operatorname{expert}} | \widetilde{s}) = 1$ (conditioned on the remaining steps follows the globally optimal solution). Furthermore, \begin{align*} \ell_{H-1}^{\star} &= 2 \left[ (1 - d^{\pi}_{H-1}(\widetilde{s})) + (1 - d^{\pi}_{H}(\widetilde{s})) \right] \\ &= 2 \left[ \left( 1 - d^{\pi}_{H-1}(\widetilde{s}) \right) + \left( 1 - p_{H-1}(\widetilde{s}|\widetilde{s}, a^{\operatorname{expert}}) d^{\pi}_{H-1}(\widetilde{s}) \right) - \sum_{s^\prime \ne \widetilde{s}} \sum_{a^\prime} p_{H-1}(\widetilde{s}|s^\prime, a^\prime) d_{H-1}^{\pi}(s^\prime, a^\prime) \right] \\ &= 2 \left[ 2 - (1 + p_{H-1}(\widetilde{s}|\widetilde{s}, a^{\operatorname{expert}})) d^{\pi}_{H-1}(\widetilde{s}) - \frac{1}{|\mathcal{A}|} \sum_{s^\prime \ne \widetilde{s}} \sum_{a^\prime} p_{H-1}(\widetilde{s}|s^\prime, a^\prime) d_{H-1}^{\pi}(s^\prime) \right]. \end{align*} Following the above procedure, we can prove the cases for $h = H-2, \cdots, 1$. $$\tag*{$\blacksquare$}$$

Note that the assumption of 1 expert trajectory is mainly to make the analysis simple and $\ell_1$-norm based state-action distribution matching is to ensure that the objective is well-defined[4]. Actually, the experiment result in Figure 1 satisfies assumptions required in Theorem 3. We remark that this conclusion is expected to hold for other AIL methods since divergence measure choices are not essential for the asymptotic sample complexity. To support this claim, we note that many divergence measures are closely related to the total variation distance, which corresponds to the $\ell_1$-norm based state-action distribution matching considered here [Ghasemipour et al., 2019, Xu et al., 2020].

Summary: We could believe that in the offline setting, AIL methods could not perform optimization on non-visited states. As a result, the BC policy is sufficient to be the globally optimal solution of AIL methods in the offline scenario. We highlight that this insight is not illustrated by the lower bound argument [Rajaraman et al., 2020, Xu et al., 2021a], which only tells us that one specific algorithm, BC, is nearly minimax optimal in the offline setting. From another viewpoint, our result also suggests that state-action distribution matching is expected to be a ‘‘universal’’ principle: it is closely related to BC in the offline setting while it enjoys many benefits in the online setting.

Overfitting of ValueDice and BC

Based on the above conclusion that AIL may reduce to BC in the offline setting, it is still unsettled that why ValueDice outperforms BC in Figure 1. If we look at training curves carefully, we would observe that BC could perform well at the initial learning phase. As the number of iterations increases, the performance of BC tends to degenerate. This is indeed the overfitting phenomenon.

To overcome the overfitting in this low-data regime, we can adopt many regularization techniques such as weight decay [Krogh and Hertz, 1991], dropout [Srivastava et al., 2014], and early stopping [Prechelt, 1996]. For simplicity, let us consider the weight decay technique, i.e., adding a $\ell_2$-norm penalty for the training parameters to the original training objective. The empirical result is shown in Figure 4. Surprisingly, we can find that even with small weight decay, BC improves its generalization and performs well as ValueDice.

Missing images
Figure 4: Comparison of BC with/without weight decay (i.e., $\ell_2$-norm based regularization) with 1 expert trajectory. Dashed lines correspond to the performance of expert policies.

The perspective of overfitting motivates us to carefully examine the performance of ValueDice. In particular, we realize that in practice ValueDice uses the orthogonal regularization [Brock et al., 2018], a kind of regularizer from the GAN literature. Without this regularizer, the performance of ValueDice is poor as shown in Figure 2.

Summary: Up to now, we know that ValueDice degenerates to BC in the offline setting and the experiment results suggest that regularization matters. With a suitable regularization, BC is able to match the expert performance as ValueDice does.

Discussion: Why BC Performs Well?

One related question about previous experiment results is that why BC (with regularization) performs so well (i.e., BC nearly matches the expert performance) even provided with 1 expert trajectory as in Figure 4. It seems that there are no compounding errors for BC. Furthermore, the presented results seem to contradict the common impression that BC performs poorly but AIL performs well [Ho and Ermon, 2016, Kostrikov et al., 2019]. We discuss this mismatch in this part.

First, the compounding errors argument for BC relies on the assumption that the agent makes a mistake at each time step as in [Ross et al., 2011]. As such, decision errors accumulate over time steps. However, the assumption may not hold for some cases. Considering MDPs with deterministic transitions, the BC’s policy does not make any mistake by starting from a visited initial state. As a consequence, there are no compounding errors along such a ‘‘good’’ trajectory. When the initial state is further unique, it makes sense that BC works even only 1 expert trajectory is provided. Formally, the policy value gap of BC is expected to be $\widetilde{\mathcal{O}}(\vert \mathcal{S} \vert H/m)$ under MDPs with deterministic transitions, where $H$ is the finite horizon and $m$ is the number of expert trajectories. Furthermore, the dependence on $\vert \mathcal{S} \vert$ is due to the random initial states. The policy value gap becomes smaller if the initial states are limited. We highlight that this upper bound is tighter than the general one $\widetilde{\mathcal{O}}( \vert \mathcal{S} \vert H^2/m)$, which are derived from MDPs with stochastic transitions [Rajaraman et al., 2020]. The fundamental difference is that stochastic MDPs ensure BC to have a positive probability to make a mistake even by starting from a visited initial state, which leads to the quadratic dependence on $H$. For our purpose, we know that MuJoCo locomotion tasks have (almost) deterministic transition functions and the initial states lie within a small range. Therefore, it is expected that BC (with regularization) performs well as in Figure 4.

Second, the worse dependence on the planning horizon for BC is observed based on subsampled trajectories in [Ho and Ermon, 2016, Kostrikov et al., 2019]. More concretely, they sample inconsecutive transition tuples from a complete trajectory as the expert demonstrations. This is different from our experiment setting, in which we use complete trajectories. It is obvious that subsampled trajectories artificially ‘‘mask’’ some states. As a result, BC is not ensured to match the expert trajectory even when the transition function is deterministic and the initial state is unique. This explains why BC performs poorly as in [Ho and Ermon, 2016, Kostrikov et al., 2019]. We note that this subsampling operation turns out minor for AIL methods (see the discussion in [Xu et al., 2021a, Section 6]). In addition, we acknowledge that existing theory [Xu et al., 2020, 2021b, Rajaraman et al., 2020] cannot perfectly explain the actual performance of BC under the subsampling case because the current theory mainly focuses on the complete trajectory case.

[click here to go to the top][click here to go to the reference]

Rethink ValueDice in the Online Setting

In this section, we consider the online setting, where the agent is allowed to interact with the environment. In particular, it is empirically shown that ValueDice outperforms another off-policy AIL algorithm DAC [Kostrikov et al., 2019] in terms of environment interactions; see [Kostrikov et al., 2020, Figure 2] or the reproduced results in the following Figure 5. More surprisingly, ValueDice successfully performs policy optimization without the useful technique target network. This contradicts common sense that optimization of Bellman backup objectives with function approximation can easily diverge (see the counter-example in [Baird, 1995]; please also refer to [Mnih et al., 2015, Zhang et al., 2021, Agarwal et al., 2021]).

Missing images
Figure 5: Comparison of ValueDice and DAC in the online setting with 1 complete expert trajectory, which reproduces [Kostrikov et al., 2020, Figure 2]. Dashed lines correspond to the performance of expert policies.

We realize that the result in [Kostrikov et al., 2020, Figure 2] is based on the setting where complete expert trajectories are provided. Under this case, we have known that ValueDice could degenerate to BC, and BC (with regularization) performs well. Furthermore, by comparing learning curves in Figure 1 (offline) and Figure 5 (online), we find that the online interaction does not matter for ValueDice. This helps confirm that the reduction is crucial for ValueDice. Thus, we know the reason why ValueDice does not converge under this case is that there is no divergence issue for BC.

Now, how about the case where expert trajectories are incomplete/subsampled? The corresponding results are missing in [Kostrikov et al., 2020] and we provide such results in the following Figure 6. In particular, we see that ValueDice fails but DAC still works. One major difference between ValueDice and DAC in policy optimization is that DAC uses the target network technique while ValueDice does not. This suggests that practical modifications proposed in [Kostrikov et al., 2020, Section 5] are not valid in general.

Missing images
Figure 6: Comparison of ValueDice and DAC in the online setting with 10 subsampled expert trajectories (the subsampling rate is 20). Dashed lines correspond to the performance of expert policies.

</figure>

Summary: In summary, our experiment results support the claim that ValueDice cannot succeed to perform policy optimization with function approximation under general cases. Moreover, our results suggest that the mentioned success of ValueDice relies on its connection to BC. Perhaps, this implication is also true for the follow-up work SoftDice [Sun et al., 2021].

[click here to go to the top][click here to go to the reference]

Conclusion

In this blog, we rethink ValueDice and promote new conclusions in both offline and online settings. Even though we focus on ValueDice, our work has the following implications on the general imitation learning studies.

  • Algorithm Evaluations: Our experiment results show a clear boundary between the complete trajectory and subsampled trajectory cases. Two cases have dramatically different characteristics, results and explanations. As a result, we must be cautious to evaluate algorithms by identifying the context. Without this, some arguments may be misleading for future studies.

  • Benchmarks: Our study points out several drawbacks of existing MuJoCo benchmarks: deterministic transitions and limited initial states. For one thing, provided 1 complete expert trajectory, simple algorithm BC is competent for many tasks. That is, current tasks are not even ‘‘hard’’. For another thing, in addition to large-scale studies in [Hussenot et al., 2021, Orsini et al., 2021] that mainly focus on existing MuJoCo benchmarks, future imitation learning studies could benefit from more challenging benchmarks with stochastic transitions and diverse initial states.

[click here to go to the top]


References

References
  • P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning, pages 1–8, 2004.
  • N. Agarwal, S. Chaudhuri, P. Jain, D. Nagaraj, and P. Netrapalli. Online target q-learning with reverse experience replay: Efficiently finding the optimal policy for linear mdps. arXiv, 2110.08440, 2021.
  • B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
  • L. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning, pages 30–37. 1995.
  • D. Bertsekas. Dynamic Programming and Optimal Control: Volume I. Athena scientific, 2012.
  • A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv, 1809.11096, 2018.
  • X. Chen, S. Li, H. Li, S. Jiang, Y. Qi, and L. Song. Generative adversarial user model for reinforcement learning based recommendation system. In Proceedings of the 36th International Conference on Machine Learning, pages 1052–1061, 2019.
  • C. Finn, S. Levine, and P. Abbeel. Guided cost learning: deep inverse optimal control via policy optimization. In Proceedings of the 33rd International Conference on Machine Learning, pages 49–58, 2016.
  • S. K. S. Ghasemipour, R. S. Zemel, and S. Gu. A divergence minimization perspective on imitation learning methods. In Proceedings of the 3rd Annual Conference on Robot Learning, pages 1259–1277, 2019.
  • J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems 29, pages 4565–4573, 2016.
  • A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys, 50(2):1–35, 2017.
  • L. Hussenot, M. Andrychowicz, D. Vincent, R. Dadashi, A. Raichuk, S. Ramos, N. Momchev, S. Girgin, R. Marinier, L. Stafiniak, M. Orsini, O. Bachem, M. Geist, and O. Pietquin. Hyperparameter selection for imitation learning. In Proceedings of the 38th International Conference on Machine Learning, pages 4511–4522, 2021.
  • L. Ke, M. Barnes, W. Sun, G. Lee, S. Choudhury, and S. S. Srinivasa. Imitation learning as f-divergence minimization. arXiv, 1905.12888, 2019.
  • I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and J. Tompson. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In Proceedings of the 7th International Conference on Learning Representations, 2019.
  • I. Kostrikov, O. Nachum, and J. Tompson. Imitation learning via off-policy distribution matching. In Proceedings of the 8th International Conference on Learning Representations, 2020.
  • A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems 4, pages 950–957, 1991.
  • S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • M. Orsini, A. Raichuk, L. Hussenot, D. Vincent, R. Dadashi, S. Girgin, M. Geist, O. Bachem, O. Pietquin, and M. Andrychowicz. What matters for adversarial imitation learning? Advances in Neural Information Processing Systems 34, 2021.
  • T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters. An algorithmic perspective on imitation learning. Foundations and Trends in Robotic, 7(1-2):1–179, 2018.
  • D. Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.
  • L. Prechelt. Early stopping-but when? In Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer Science, pages 55–69. 1996.
  • M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
  • N. Rajaraman, L. F. Yang, J. Jiao, and K. Ramchandran. Toward the fundamental limits of imitation learning. In Advances in Neural Information Processing Systems 33, pages 2914–2924, 2020.
  • N. Rajaraman, Y. Han, L. Yang, J. Liu, J. Jiao, and K. Ramchandran. On the value of interaction and function approximation in imitation learning. Advances in Neural Information Processing Systems 34, 2021a.
  • N. Rajaraman, Y. Han, L. F. Yang, K. Ramchandran, and J. Jiao. Provably breaking the quadratic error compounding barrier in imitation learning, optimally. arXiv, 2102.12948, 2021b.
  • S. Ross and D. Bagnell. Efficient reductions for imitation learning. In Proceedings of the 13rd International Conference on Artificial Intelligence and Statistics, pages 661–668, 2010.
  • S. Ross, G. J. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011.
  • J. Shi, Y. Yu, Q. Da, S. Chen, and A. Zeng. Virtual-taobao: virtualizing real-world online retail environment for reinforcement learning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, pages 4902–4909, 2019.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • M. Sun, A. Mahajan, K. Hofmann, and S. Whiteson. Softdice for imitation learning: Rethinking off-policy distribution matching. arXiv, 2106.03155, 2021.
  • U. Syed and R. E. Schapire. A game-theoretic approach to apprenticeship learning. In Advances in Neural Information Processing Systems 20, pages 1449–1456, 2007.
  • T. Xu, Z. Li, and Y. Yu. Error bounds of imitating policies and environments. In Advances in Neural Information Processing Systems 33, pages 15737–15749, 2020.
  • T. Xu, Z. Li, and Y. Yu. Error bounds of imitating policies and environments for reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021a.
  • T. Xu, Z. Li, and Y. Yu. More efficient adversarial imitation learning algorithms with known and unknown transitions. arXiv, 2106.10424, v2, 2021b.
  • S. Zhang, H. Yao, and S. Whiteson. Breaking the deadly triad with a target network. In Proceedings of the 38th International Conference on Machine Learning, pages 12621–12631, 2021.

[click here to go to the top]


Footnotes

Footnote 1: As commented by one reviewer of ValueDice [Kostrikov et al., 2020]: “if you can empirically show that ValueDice can effectively learn expert policy in off-policy setting, this will be extremely useful in many real-world tasks where we cannot collect transition data directly for exploration”. Please refer to https://openreview.net/forum?id=Hyg-JC4FDr .

[go back to Footnote 1]

Footnote 2: Our implementation of BC is different from the one in [Kostrikov et al., 2020]. Unlike the implementation in [Kostrikov et al., 2020], our BC policy does not learn the covariance. Our implementation is more nature and is widely used in the previous literature [Ho and Ermon, 2016, Kostrikov et al., 2019].

[go back to Footnote 2]

Footnote 3: Note that the adversarial formulation provides a specific optimization procedure for the general state-action distribution matching objective. When we care about the sample complexity, it is enough to focus on the underlying state-action distribution matching objective.

[go back to Footnote 3]

Footnote 4: In contrast, the KL-divergence used in ValueDice is not well defined for some state-action pair $(s, a)$ with $d^{\pi}(s, a) = 0$ but $d^{\operatorname{exp}}(s, a) > 0$.

[go back to Footnote 4]


Experiment Details

Experiment Details Algorithm Implementation: Our implementation of ValueDice and DAC follows the public repository https://github.com/google-research/google-research/tree/master/value_dice by Kostrikov et al. [2020]. Our implementation of BC is different from the one in this repository. In particular, Kostrikov et al. [2020] implemented BC with a Gaussian policy with trainable mean and covariance as in [Kostrikov et al., 2020, Figure 4]. However, we observe that the performance of this implementation is very poor because the mean and covariance share the same hidden layers and the covariance affects the log-likelihood estimation. Instead, we use a simple MLP architecture without the output of the covariance. This deterministic policy is trained with mean-square-error (MSE): \begin{align*} \min_{\theta} \sum_{(s, a) \sim \mathcal{D}} (f_{\theta}(s) - a)^2. \end{align*} The hidden layer size and optimizer of our BC policy follow the configuration for ValueDice.

Benchmarks: All preprocessing procedures follow [Kostrikov et al., 2020]. The subsampling procedure follows [Kostrikov et al., 2019]; please refer to https://github.com/google-research/google-research/blob/master/dac/replay_buffer.py#L154-L177. The expert demonstrations are from https://github.com/google-research/google-research/tree/master/value_dice#install-dependencies.

Experiments: All algorithms are run with 5 random seeds (2021-2025). For all plots, solid lines correspond to the mean, and shaded regions correspond to the standard deviation.

[click here to go to the top]