Exploration by Random Network Distillation (Burda et al., 2018)

Abstract

Exploration by Random Network Distillation by Burda et al. proposed a new exploration algorithm that achieved superhuman performance in Montezuma’s Revenge. This post explains in detail the Random Network Distillation (RND) algorithm with external resources to help readers understand how the paper fits into the broader field.

Accompanying Resources

  • Proximal Policy Optimization Algorithms (Schulman et al., 2017) [Arxiv]
  • Curiosity-driven Exploration by Self-supervised Prediction (Pathak et al., 2017) [Arxiv]
  • Randomized Prior Functions for Deep Reinforcement Learning (Osband et al., 2018) [Arxiv]
  • Large-Scale Study of Curiosity-Driven Learning (Burda et al., 2018) [Arxiv]
  • Implementation of RND in the ICLR2019 submission [Google Drive]

1 Introduction

Reinforcement Learning (RL) works well when the reward function is dense and easy to find.

  • Dense: A lot of rewards are nonzero.
  • Easy to find: A random agent finds nonzero rewards.

However, reinforcement learning algorithms fail when the rewards are sparse and hard to find. One solution would be to hand-engineer dense reward functions. However, this is often impractical or impossible. Another solution is to develop more sophisticated exploration methods. Exploration methods have been a popular research topic, with a lot of new sophisticated methods with better results on hard exploration games.

Count-based Curiosity
Unifying Count-Based Exploration and Intrinsic Motivation Curiosity-driven Exploration by Self-supervised Prediction
Count-Based Exploration with Neural Density Models Large-Scale Study of Curiosity-Driven Learning

However, these exploration methods are difficult to scale up: due to their complexity, it is difficult to deploy them in parallel environments. This is a crucial problem since recent state-of-the-art methods rely on using parallel environments to collect a large number of samples. The authors propose an approach called Random Network Distillation (hereafter RND) that is simpler to implement, works with high-dimensional observations, can be incorporated with policy optimization algorithms, and is efficient.

RND is tested on a few selected environments from Atari 2600 games, a standard benchmark for deep reinforcement learning algorithms. As RND is an exploration algorithm, the authors test RND on hard exploration games with sparse rewards: Freeway, Gravitar, Montezuma’s Revenge, Pitfall!, Private Eye, Solaris, and Venture.

A rough taxonomy of Atari Environments by their exploration difficulties. From Count-Based Exploration with Neural Density Models (Ostrovski et al., 2017)

Combined with Proximal Policy Optimization (PPO), RND achieves state-of-the-art performance in Montezuma’s Revenge (when published), often finding 22 out of 24 rooms on the first level and often solving the first level without using demonstrations or having access to the underlying state of the game.

A demo of RND passing the first level of Montezuma's Revenge. By OpenAI.

2 Method

2.1 Exploration Bonuses

Exploration bonuses are a class of methods that encourages exploration even when the reward $e_t$ is sparse. This is done by augmenting $e_t$ to create a new reward $r_t = e_t + i_t$, where $i_t$ is the exploration bonus associated with the transition at time $t$. The reward given by the environment is often called the extrinsic reward, and the additional reward is called the intrinsic reward.

Check Section 4.1 for more information about different exploration algorithms.

2.2 Random Network Distillation

Random Network Distillation (RND) is a state-based prediction-error-based exploration method. RND uses two networks: a target network $f$ and a predictor network $\hat{f}$. The target network is fixed after random initialization and is the target of the prediction problem. The predictor network trains on the data collected by the agent to solve the prediction problem. In other words, with the data collected by the agent, the predictor network $\hat{f}$ is trained via gradient descent to minimize the MSE error:

\[|| \hat{f}(x;\theta) - f(x) ||^2\]

This training process distills a randomly initialized (target) network into a trained (predictor) network.

Similar to that of the Intrinsic Curiosity Module (ICM; prior work by Pathak et al., 2017), the prediction error is low on states that are similar to states already visited. In contrast, the prediction error is higher for novel states that are different from the states the predictor network has been trained on. Thus, the intrinsic reward $i_t$ is defined as the MSE error of the two networks $f$ and $\hat{f}$.

A diagram illustrating the RND algorithm.

To test the validity of detecting novelty through the prediction error of target and predictor networks, the authors train a toy model with MNIST. The predictor neural network is trained on a mixed dataset of images with two classes: the 0 class and the target class (ex. 1). The 0 class represents states that have been seen many times before, and the target class represents novel states. With various proportions of 0 class to target class while keeping the total amount of data constant, the experiments show that the test error decreases when more target class data is available.

Figure 2
Novelty detection on MNIST. From Figure 2 of this paper.

In this MNIST experiment, the MSE loss never reaches 0. This means that the predictor network is not able to mimic the target random network perfectly. This is desirable, as it implies that “standard gradient-based methods do not overgeneralize” such that the intrinsic reward becomes 0.

Empirically, in Montezuma’s Revenge, the spikes in the intrinsic reward (or the prediction error) correspond to meaningful events: losing a life (2, 8, 10, 21), escaping an enemy by a narrow margin (3, 5, 6, 11, 12, 13, 14, 15), passing a difficult obstacle (7, 9, 18), or picking up an object (20, 21).

Figure 1
RND exploration bonus over an episode where the agent first successfully picks up the torch. From Figure 1 of this paper.

2.2.1 Sources of Prediction Errors

Generally, in deep learning, prediction error can be attributed to four factors:

  1. Amount of training data: Prediction error is high because the predictor fails to generalize from previously seen examples.
  2. Stochasticity: Prediction error is high because the prediction target is stochastic.
  3. Model misspecification: Prediction error is high because the information necessary for prediction is missing, or because the predictor’s model is too limited to model the complexity of the prediction target.
  4. Learning dynamics: Prediction error is high due to failing to find the best approximation of the prediction target in the optimization process.

Factor 1 is a useful source of error since it validates the use of RND. However, other sources of prediction errors can create undesirable effects in prediction-based exploration methods.

The most famous example is the noisy-TV problem, relevant to factor 2. Consider a maze environment with visual input. In this deterministic environment, maximizing prediction error would be beneficial, since it rewards exploring unvisited areas. Now, suppose there is a noisy TV attached to a wall inside the maze. Now, if the agent ever looks at the TV, it will always receive a high reward, due to its randomness.

The Noisy TV Problem
The noisy-TV problem where an agent is stuck watching a noisy TV. From Reinforcement Learning with Prediction-Based Rewards by OpenAI.

Although this example might feel too artificial, prediction-based exploration was shown to be attracted to the inherent stochasticity of the environment. This includes Montezuma’s Revenge.

The Noisy TV Problem
The noisy-TV problem in Montezuma's Revenge. Agent abuses changing rooms to gain high prediction errors. From Reinforcement Learning with Prediction-Based Rewards by OpenAI.

Previous methods tried to avoid these factors by using the relative improvement of the prediction error $\Delta E$, rather than the absolute error $E$. Sadly, this is difficult to implement efficiently. In contrast, RND obviates both factors 2 and 3. The target network is fixed, so it is deterministic, not stochastic. Also, the target network and the predictor network have the same architecture, so the model cannot be limited.

2.2.2 Relation to Uncertainty Quantification

It is possible to see the prediction error of RND as a quantification of uncertainty.

Consider a regression problem with the data distribution $D = \{x_i, y_i\} _ i$. In the Bayesian setting, we would consider a prior $p(\theta^* )$ over the parameters of a mapping $f_{\theta^*}$, then calculate the posterior after updating on the evidence.

The authors follow Lemma 3 of Osband et al. (2018).

From Randomized Prior Functions for Deep Reinforcement Learning (Osband et al., 2018)

Let $\mathcal{F}$ be the distribution over functions $g_\theta = f_\theta + f_{\theta^* }$ (ensemble). $\theta^ *$ is drawn from the prior $p(\theta^ *)$, and $\theta$ is given by minimizing the expected prediction error

\[\theta = \text{argmin}_ \theta \mathbb{E}_{(x_i, y_i) ~ D} || f_\theta(x_i) + f_{\theta^*}(x_i)-y_i||^2 + \mathcal{R}(\theta)\]

where $\mathcal{R}(\theta)$ is a regularization term shown at the end of equations (4) and (5) of Lemma 3.

Now, let us confine the regression problem to predicting the constant zero function $y_i = 0$.

\[\theta = \text{argmin}_ \theta \mathbb{E}_{(x_i, y_i) ~ D} || f_\theta(x_i) + f_{\theta^*}(x_i)||^2 + \mathcal{R}(\theta)\]

Then, the optimization problem is equivalent to distilling a randomly drawn function from the prior. With $f_\theta^*$ being the target and $f_\theta$ being the predictor, the distillation error can be seen as a quantification of uncertainty in predicting the constant zero function $y_i = 0$.

2.3 Combining Intrinsic and Extrinsic Returns

Intrinsic Reward and Non-episodic Environment

When using only intrinsic reward, the authors explore changing the problem as non-episodic. In other words, returns remain untruncated when the game is over. There are several justifications for this. First, it tells the agent that its intrinsic return should be related to all the novel states that it could find in all future episodes, not just this episode. Also, using episodic intrinsic rewards can leak information about the task to the agent, so it no longer becomes intrinsic-only (Burda et al., 2018).

From Large-Scale Study of Curiosity-Driven Learning (Burda et al., 2018)

The authors argue that this approach is also closer to how humans explore games. Suppose Alice is playing a tricky part of the game where it is easy to fail. If she succeeds, then she will fulfill her curiosity, so the reward is high. If she fails, she has to repeat the “boring” task, so the reward should be small. However, if Alice is modeled as an episodic agent, the return of game over is 0 by definition, which could be a high reward depending on the environment. Thus, in some environments, Alice will be overly risk-averse, not considering the “boredom” from game over.

For empirical results, check Section 3.1.

Extrinsic Reward and Episodic Environment

However, when we use extrinsic rewards, we should use the episodic problem setting. If we use non-episodic returns, the agent could find a strategy to exploit this setting by finding an extrinsic reward close to the beginning of the game and deliberately dying quickly. This can be seen as reward farming, a common phenomenon when the reward function is designed inappropriately.

Agent exploiting Blades of Vengeance. From Gym Retro by OpenAI.

Combining Intrinsic and Extrinsic Reward

Intrinsic rewards benefit from a non-episodic setting, while extrinsic rewards benefit from an episodic setting. We want a dense reward signal, so we want to use both intrinsic and extrinsic rewards, but it is nontrivial to estimate the combined return from two streams of rewards.

The authors solve this by fitting two value heads $V_E$ and $V_I$ separately to their respective returns. $V_E$ estimates the cumulative extrinsic reward, while $V_I$ estimates the cumulative intrinsic reward. These two value heads are added to get the value function $V = V_E + V_I$.

Fitting two value heads can have a bonus effect: the extrinsic reward function is stationary, while the intrinsic reward function is non-stationary. If we were to use a single value function $V$, it would need to estimate a non-stationary reward function. However, with two value heads, $V_E$ can focus on the stationary reward function.

For empirical results, check Section 3.2.

Separate Value Functions

The above section discussed fitting two value heads above in the context of combining two reward streams with different problem settings. However, the same idea can also be used to combine reward streams with different discount factors $\gamma$.

For empirical results, check Section 3.3.

3 Experiments

The majority of the experiments in this paper are tested on Montezuma’s Revenge. This environment has been found to be the hardest for agents to explore without access to expert demonstrations or underlying emulator states. Two metrics are used: mean episodic return and mean number of rooms found.

3.1 Pure Exploration

In this section, the authors test the hypothesis in Section 2.3 that the non-episodic setting is a more natural setting when only the intrinsic reward $i_t$ is used.

Mean episodic return and mean number of rooms on Montezuma's Revenge when trained without extrinsic reward. From Figure 3 of this paper.

Since the agent is using the intrinsic reward only, it is not directly optimizing for either metric. However, to get a high intrinsic reward, the agent needs to find novel states, including finding the key and opening the room with that key. Thus, the agent shows improvement over time in both metrics.

Note that the mean episodic returns are somewhat inconsistent: it decreases slowly after increasingly sharply for 0.4 billion frames. This is because once the agent learned how to use an item or reach a room, it is no longer interesting to the agent, so the intrinsic reward of performing such actions is low. Therefore, even though the agent reaches more and more rooms, it receives less and less rewards.

3.2 Combining Episodic and Non-episodic Returns

In the experiment above, episodic and non-episodic settings were compared with agents only trained on intrinsic rewards. A natural experiment to follow would be to compare these settings again using both intrinsic and extrinsic rewards. Extrinsic rewards are fixed to be episodic, and both episodic and non-episodic settings are tested for intrinsic rewards. If both rewards are episodic, it is possible to use a single value head, which the authors also experiment with.

Mean episodic return and mean number of rooms on Montezuma's Revenge for different combination strategies of intrinsic and extrinsic rewards using CNN policy. From Figure 6 (b) of this paper.

Contrary to the author’s expectations, using two value heads with non-episodic intrinsic reward and episodic extrinsic reward did not show any benefit over other methods. Nevertheless, the remaining experiments still use two value heads with non-episodic intrinsic rewards.

Similar experiments are performed with RNN policies, but they consistently have worse performance than CNN policies. Check Section 3.4 below for more details.

3.3 Discount Factors

Previous state-of-the-art works of Montezuma’s Revenge reported better performance using higher discount factors since it allows the agent to look further into the future. A standard discount factor has been 0.99, but higher values have shown better performance for algorithms that can handle the instability of higher values. Increasing discount factor means that the agent looks further ahead into the future, thus resulting in increased variance. Therefore, the discount factor is an important hyperparameter that should be tuned, as shown below.

Effect of higher discount factor (> 0.99). Figure 3 (b) from Expert-augmented actor-critic for ViZDoom and Montezuma’s Revenge (Garmulewicz, Michalewski, and Miłos´, 2018).

Following these previous works highlighting the importance of high discount factors, the authors compare different values of $\gamma_I, \gamma_E$.

Mean episodic return and mean number of rooms on Montezuma's Revenge for different discount factors. From Figure 4 of this paper.

We see that $\gamma_I = 0.99$ and $\gamma_E = 0.999$ yield the best result, with a mean return of 11.5K.

3.4 Recurrence

Montezuma’s Revenge is a partially observable environment. The observation only includes information about the current room and the number of keys the player has. From the observation, the agent cannot deduce where the keys came from, how many were used, or which doors are open.

To deal with this partial observability, it is possible to reformulate a state as a summary of the past using a recurrent neural network (RNN). This is a similar approach to deep recurrent Q-network (DRQN).

The DRQN architecture From Deep Q-Learning with Recurrent Neural Networks (Chen et al., 2015)

To discern, the new state formulation is labeled the RNN policy, and the old state formulation using just the visual observation is called the CNN policy.

Comparison of mean episodic return and mean number of rooms on Montezuma's Revenge for RNN and CNN policies. From Figure 6 of this paper.

To the surprise of the authors, RNN policy results in worse performance compared to CNN policy.

3.5 Scaling Up RNN Training

In this section, the authors further investigate RNN policies to show the effect of the increased scale of parallel environments. For all experiments in this section, intrinsic rewards are non-episodic, and $\gamma_I = 0.99, \gamma_E = 0.999$.

Agents are tested with $[32, 128, 256, 1024]$ parallel environments. For a fair comparison of environments, the batch size must be fixed. This is because having a larger batch size results in the predictor network learning quickly, resulting in a rapid decrease in the intrinsic reward function. Thus, when we increase the number of environments from 32 to 128 (4 times), 75% of the elements are randomly dropped out, keeping just 25%. Similarly, when we scale up from 32 to 256 and 1024, we keep just 12.5% and 3.125% of the batch.

Performance of RNN agents with a different number of parallel environments. From Figure 5 of this paper.

As predicted, the agent performs better with more parallel environments. With 1024 environments, the RNN RND agent had a mean episodic return of 10070 with the best return of 14415.

Separately, the authors allowed the RNN RND agent with 32 environments to train for 1.6M parameter updates (1.6B frames). This agent had a mean episodic return of 7570, and the best run was able to achieve a return of 17500, visiting all 24 rooms and completing the first level.

Mean of RNN RND agents with 32 parallel environments. From Reinforcement Learning with Prediction-Based Rewards by OpenAI.

3.6 Comparison to Baselines

To compare it with two existing works, RND is also trained on 6 hard exploration Atari 2600 games: Gravitar, Montezuma’s revenge, Pitfall!, Private Eye, Solaris, and Venture.

The first baseline is the “vanilla” Proximal Policy Optimization (PPO) agent, without any exploration bonus.

The second baseline is PPO with a different exploration bonus mechanism based on forward dynamics error. There are numerous works on designing intrinsic rewards with forward dynamics, as described in Section 4.1. Among those, the authors select the Intrinsic Curiosity Module (ICM). It is a good representative of prior methods using forward dynamics error.

Furthermore, Burda et al. (2018) showed that training a forward dynamics model in a random feature (RF) space works as well as any other feature space most of the time, so the authors use the RF space instead (ICM-RF). RND and ICM-RF are quite similar, allowing for a direct comparison of algorithms while fixing other parts of the methods such as dual value heads, non-episodic intrinsic returns, normalization schemes, etc.

Different feature spaces experimented for training a forward dynamics model. Figure 2 from Large-Scale Curiosity-Driven Learning (Burda et al., 2018)
Performances of PPO, RND, and ICM-RF (labeled CNN policy, dynamics). Figure 7 from this paper.
Performances of PPO, RND, and ICM-RF (labeled DYN CNN). Table 5 from this paper.

RND achieves new state-of-the-art for Gravitar and Montezuma’s Revenge and competes for the state-of-the-art in Venture. RND gets a sub-state-of-the-art score on Private Eye and Solaris but is better than PPO and ICM-RF. Like all other methods, RND fails to get a positive score for Pitfall.

3.7 Qualitative Analysis: Dancing with Skulls

Observing the RND agent, the authors found that once the agent obtains all the extrinsic rewards it knows how to obtain reliably, it continues to interact with potentially dangerous objects. For instance, in Montezuma’s Revenge, the agent jumps back and forth over a moving skill that upon contact makes the agent lose its life. Similarly, in Pitfall, the agent repeatedly “dances” with the rope and the scorpion.

The authors speculate that the agent adapts this behavior because such dangerous states are difficult to achieve or stay alive, it is therefore rarely represented in the agent’s past experience compared to safer states.

The videos can be found by this Google Drive folder link shared by the authors.

4 Related Work

4.1 Exploration

To encourage exploration, the intrinsic reward $i_t$ should be designed so that it is higher in novel states than in frequently visited states. If the environment was so simple that the states and their visitation counts can be represented by a table, we can tally the number of visits at each state. If the environment was a 5x5 grid, we only need to keep track of 25 numbers. In such tabular cases, we can define $i_t$ as a decreasing function of the visitation count $n(s)$. These are called count-based exploration methods.

\[i_t = \frac{\beta}{n(s)}, \frac{\beta}{\sqrt{n(s)}}\]

where $\beta$ is an optional coefficient to tune exploration. However, most interesting environments are much more complex. For example, if the state space were a real line, and the agent that starts at a random number can move left or right by any distance, most states will be visited at most once. In such non-tabular cases, it is difficult to define a visitation count. A possible generalization is to define a pseudo-count, using state density estimates $N(s)$ as an exploration bonus. Using density estimates, even states that have never been visited have a positive pseudo-count if it is similar to other visited states.

Another way to design the intrinsic reward $i_t$ is to define it with a prediction error for a problem related to the agent’s transitions. Dynamics prediction methods are exploration methods that predict the environment dynamics and use the prediction error to define the exploration bonus. Simply using the prediction value makes the agent susceptible to the “noisy-TV” problem in a stochastic or partially observable environment, so different metrics such as measuring prediction improvement are used.

Schmidhuber
Curiosity proposed by Schmidhuber in 1991. From A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Networks (Schmidhuber, 1991)

The most relevant example would be the Intrinsic Curiosity Module (Pathak et al., 2017; Burda et al., 2018). The Intrinsic Curiosity Module (ICM) trains forward model that outputs a prediction $\hat{\phi}(s_{t+1})$ that attempts to predict the encoded next state $\phi(s_{t+1})$ given encoded state $\phi(s_t)$ and action $a_t$. The intrinsic reward $r^i_t$ is defined as the prediction error of the forward model. The forward model is trained as the agent explores the environment. Thus, low prediction error means that the ICM has understood the transition $(s_t, a_t)$.

Diagram illustrating the ICM algorithm. From Curiosity-driven Exploration by Self-supervised Prediction (Pathak et al., 2017)

Other exploration methods include adversarial self-play, empowerment maximization, parameter noise injection, option discovery, and ensembles.

4.2 Montezuma’s Revenge

Commonly known as one of the hardest problems of Atari 2600 since the birth of Deep Q-Networks (DQN; Mnih et al., 2015), Montezuma’s Revenge has been a standard benchmark for exploration algorithms.

Without any explicit exploration bonus, early deep reinforcement learning algorithms such as DQNs failed to make meaningful progress. However, in 2018, Ape-X (Horgan et al., 2018), IMPALA (Espeholt et al., 2018), and Self-Imitation Learning (SIL; Oh et al., 2018) showed that even without such bonus, it is possible to achieve a score of 2500.

Using pseudo-count exploration bonus discussed above allowed for new state-of-the-art performance, as shown by DQN-CTS (Bellemare et al., 2016) and DQN-PixelCNN (Ostrovski et al., 2017).

Some have also improved exploration by using the internal RAM state available, hand-crafting exploration bonuses. Despite such access, these methods still achieved below the score of an average human.

Expert demonstrations have been used to simplify the exploration problem. With this information, multiple methods such as atari-reset achieved superhuman performance. However, learning from expert demonstrations exploits the deterministic nature of the environment. To prevent the agent from simply memorizing the expert’s sequence of actions, newer methods have been tested with the stochastic variant with sticky actions (each action repeated with some probability).

4.3 Random Features

Using the features of a randomly initialized neural network has been extensively studied in the context of supervised learning. It has also recently been used in reinforcement learning as an exploration technique by Osband et al. (2018) and Burda et al. (2018). This work was motivated by Osband et al. as shown in Section 2.2, as the authors use a lemma from this work. The work by Burda et al. was used as a baseline in Section 3.6.

4.4 Vectorized Value Functions

The idea of a vectorized value function was used in Temporal Difference Models (TDM; Pong et al.,2018) and C51 (Bellmare et al., 2017).

5 Discussion

RND was able to use directed exploration to achieve high performance in Atari games despite its simplicity. This suggests that when applied at scale, even simple exploration methods can solve hard exploration games. The results also suggest that methods that can treat intrinsic and extrinsic rewards separately can benefit from such flexibility.

RND is enough to deal with local exploration: exploring the consequences of short-term decisions, like choosing to interact or avoid a particular object. However, the authors discuss that RND does not perform global exploration that involves coordinating decisions over long time horizons.

To understand global exploration, let us consider Montezuma’s Revenge. The RND agent is good at exploring short-term decisions: it can choose to use or avoid the ladder, key, skull, or other objects. However, Montezuma’s Revenge requires more than these local explorations. In the first level of Montezuma’s Revenge, there are four keys and six locked doors spread throughout the level. Any key can open any door, but the key is consumed in the process. To solve the first level, the agent must enter a room locked behind two doors, so the agent must not open the two other doors that are easier to find, even though they would be rewarded for opening them. This requires global exploration through long-term planning.

How can we convince the agent to make such behavior? Since not opening the other two doors results in a loss of rewards, the agent should receive enough intrinsic reward to compensate for the loss of extrinsic rewards. The authors suspect that the RND agent does not seem to get enough incentive through intrinsic rewards to try this strategy, and thus it rarely manages to finish the level.

Final Thoughts

Questions

  • The authors argue that RND trivializes the noisy-TV problem. However, can’t “dancing with skulls” be thought of as a variant of the noisy-TV problem?
  • In simulated environments, “dancing with skulls” is just an interesting observation. However, if the agent is deployed in real life, we would like the agent to stay away from danger. (For example, robot deployed in a firefighting operation.) Are there methods to discourage such behavior after training has finished? Can they coexist with exploration methods?
  • In Section 5, the authors argue that RND shows how dividing intrinsic and extrinsic rewards could benefit the agent. However, single value head seems to do just as well as double value heads (Figure 6 in Section 3.2). Do most of this benefit come from being able to fine-tune the discount factor (Figure 4)?

Recommended Next Papers