<!DOCTYPE html>
<html lang="en-us">

  <head>
  <link href="http://gmpg.org/xfn/11" rel="profile">
  <meta http-equiv="content-type" content="text/html; charset=utf-8">

  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">

  <title>
    
      Exploration by Random Network Distillation (Burda et al., 2018) &middot; The ICLR Blog Track
    
  </title>

  
  <link rel="canonical" href="https://iclr.iro.umontreal.ca/8416f0c9-c037-40c4-81bd-1635fb22227a_1642245050/2021/12/01/exploration-by-random-network-distillation/">
  

  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/8416f0c9-c037-40c4-81bd-1635fb22227a_1642245050/public/css/poole.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/8416f0c9-c037-40c4-81bd-1635fb22227a_1642245050/public/css/syntax.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/8416f0c9-c037-40c4-81bd-1635fb22227a_1642245050/public/css/lanyon.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/8416f0c9-c037-40c4-81bd-1635fb22227a_1642245050/public/css/custom.css">
  <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">

  <link rel="apple-touch-icon-precomposed" sizes="144x144" href="https://iclr.iro.umontreal.ca/8416f0c9-c037-40c4-81bd-1635fb22227a_1642245050/public/apple-touch-icon-precomposed.png">
  <link rel="shortcut icon" href="https://iclr.iro.umontreal.ca/8416f0c9-c037-40c4-81bd-1635fb22227a_1642245050/public/favicon.ico">

  <link rel="alternate" type="application/rss+xml" title="RSS" href="https://iclr.iro.umontreal.ca/8416f0c9-c037-40c4-81bd-1635fb22227a_1642245050/atom.xml">

  

  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript" ></script>
 <!-- <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } } }); </script> -->
  <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ],
         processEscapes: false
        }
      });
</script>
</head>


  <body>

    <!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
     styles, `#sidebar-checkbox` for behavior. -->
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">
<!-- <input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox" > -->

<!-- Toggleable sidebar -->
<div class="sidebar" id="sidebar">
  <div class="sidebar-item">
    <p>For short-term, peer-sourced tests of time, generalizations, specializations, reproductions, etc.!</p>
  </div>

  <nav class="sidebar-nav">

    

    
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/8416f0c9-c037-40c4-81bd-1635fb22227a_1642245050/">ICLR 2022 Blog Track</a>
        
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/8416f0c9-c037-40c4-81bd-1635fb22227a_1642245050/about/">About</a>
        
      
    
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/8416f0c9-c037-40c4-81bd-1635fb22227a_1642245050/submitting/">Submitting</a>
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/8416f0c9-c037-40c4-81bd-1635fb22227a_1642245050/tags/">Tags</a>
        
      
    

    <a class="sidebar-nav-item" href="https://github.com/iclr-blog-track/iclr-blog-track.github.io">GitHub project</a>
    <span class="sidebar-nav-item">Currently vICLR Spring 2021</span>
  </nav>

  <div class="sidebar-item">
    <p>
      &copy; 2022. All rights reserved.
    </p>
  </div>
</div>


    <!-- Wrap is the content to shift when toggling the sidebar. We wrap the
         content to avoid any CSS collisions with our real content. -->
    <div class="wrap">
      <div class="masthead">
        <div class="container">
          <h3 class="masthead-title">
            <a href="/" title="Home">The ICLR Blog Track</a>
            <small></small>
          </h3>
        </div>
      </div>

      <div class="container content">
        <div class="post">
  <h1 id="iclr-post-title" class="post-title">Exploration by Random Network Distillation (Burda et al., 2018)</h1>
  <span class="post-date">01 Dec 2021 | 
    <a class="content-tag" href="/tags/#machine-learning"> machine learning </a>
  
    <a class="content-tag" href="/tags/#deep-learning"> deep learning </a>
  
    <a class="content-tag" href="/tags/#reinforcement-learning"> reinforcement learning </a>
  
    <a class="content-tag" href="/tags/#exploration"> exploration </a>
  </span>

  <span id="iclr-post-authors" class="post-date">Anonymous</span>
  <style>
figcaption {
    margin-bottom: 1rem;
}
</style>

<p align="center">
<img src="/public/images/2021-12-01-exploration-by-random-network-distillation/front.png" alt="Abstract" />
</p>

<p><a href="https://openreview.net/forum?id=H1lJJnR5Ym">Exploration by Random Network Distillation</a> by Burda et al. proposed a new exploration algorithm that achieved superhuman performance in <em>Montezuma’s Revenge</em>. This post explains in detail the Random Network Distillation (RND) algorithm with external resources to help readers understand how the paper fits into the broader field.</p>

<p><strong>Accompanying Resources</strong></p>

<ul>
  <li><em>Proximal Policy Optimization Algorithms</em> (Schulman et al., 2017) <a href="https://arxiv.org/abs/1707.06347">[Arxiv]</a></li>
  <li><em>Curiosity-driven Exploration by Self-supervised Prediction</em> (Pathak et al., 2017) <a href="https://arxiv.org/abs/1705.05363">[Arxiv]</a></li>
  <li><em>Randomized Prior Functions for Deep Reinforcement Learning</em> (Osband et al., 2018) <a href="https://arxiv.org/abs/1806.03335">[Arxiv]</a></li>
  <li><em>Large-Scale Study of Curiosity-Driven Learning</em> (Burda et al., 2018) <a href="https://arxiv.org/abs/1808.04355">[Arxiv]</a></li>
  <li>Implementation of RND in the ICLR2019 submission <a href="https://goo.gl/DGPC8E">[Google Drive]</a></li>
</ul>

<h2 id="section-1">1 Introduction</h2>

<p>Reinforcement Learning (RL) works well when the reward function is dense and easy to find.</p>

<ul>
  <li><strong>Dense</strong>: A lot of rewards are nonzero.</li>
  <li><strong>Easy to find</strong>: A random agent finds nonzero rewards.</li>
</ul>

<p>However, reinforcement learning algorithms fail when the rewards are sparse and hard to find. One solution would be to hand-engineer dense reward functions. However, this is often impractical or impossible. Another solution is to develop more sophisticated exploration methods. Exploration methods have been a popular research topic, with a lot of new sophisticated methods with better results on hard exploration games.</p>

<table>
  <thead>
    <tr>
      <th>Count-based</th>
      <th>Curiosity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><a href="https://arxiv.org/abs/1606.01868">Unifying Count-Based Exploration and Intrinsic Motivation</a></td>
      <td><a href="https://arxiv.org/abs/1705.05363">Curiosity-driven Exploration by Self-supervised Prediction</a></td>
    </tr>
    <tr>
      <td><a href="https://arxiv.org/abs/1703.01310">Count-Based Exploration with Neural Density Models</a></td>
      <td><a href="https://arxiv.org/abs/1808.04355">Large-Scale Study of Curiosity-Driven Learning</a></td>
    </tr>
  </tbody>
</table>

<p>However, these exploration methods are difficult to scale up: due to their complexity, it is difficult to deploy them in parallel environments. This is a crucial problem since recent state-of-the-art methods rely on using parallel environments to collect a large number of samples. The authors propose an approach called <strong>Random Network Distillation</strong> (hereafter RND) that is simpler to implement, works with high-dimensional observations, can be incorporated with policy optimization algorithms, and is efficient.</p>

<p>RND is tested on a few selected environments from <em>Atari 2600</em> games, a standard benchmark for deep reinforcement learning algorithms. As RND is an exploration algorithm, the authors test RND on hard exploration games with sparse rewards: Freeway, Gravitar, Montezuma’s Revenge, Pitfall!, Private Eye, Solaris, and Venture.</p>

<figure align="center">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/atari-chart.png" alt="" />
  <figcaption>A rough taxonomy of Atari Environments by their exploration difficulties. From <em>Count-Based Exploration with Neural Density Models</em> (Ostrovski et al., 2017)</figcaption>
</figure>

<p>Combined with Proximal Policy Optimization (PPO), <strong>RND achieves state-of-the-art performance in Montezuma’s Revenge</strong> (when published), often finding 22 out of 24 rooms on the first level and often solving the first level without using demonstrations or having access to the underlying state of the game.</p>

<figure align="center">
  <div class="youtube-responsive">
    <iframe width="560" height="315" src="https://www.youtube.com/embed/40VZeFppDEM" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
  </div>
  <figcaption>A demo of RND passing the first level of Montezuma's Revenge. By OpenAI.</figcaption>
</figure>

<h2 id="section-2">2 Method</h2>

<h3 id="section-2-1">2.1 Exploration Bonuses</h3>

<p><strong>Exploration bonuses</strong> are a class of methods that encourages exploration even when the reward $e_t$ is sparse. This is done by augmenting $e_t$ to create a new reward $r_t = e_t + i_t$, where $i_t$ is the <strong>exploration bonus</strong> associated with the transition at time $t$. The reward given by the environment is often called the <strong>extrinsic reward</strong>, and the additional reward is called the <strong>intrinsic reward</strong>.</p>

<p>Check <a href="#section-4-1">Section 4.1</a> for more information about different exploration algorithms.</p>

<h3 id="section-2-2">2.2 Random Network Distillation</h3>

<p><strong>Random Network Distillation (RND)</strong> is a state-based prediction-error-based exploration method. RND uses two networks: a <strong>target network</strong> $f$ and a <strong>predictor network</strong> $\hat{f}$. The target network is fixed after random initialization and is the target of the prediction problem. The predictor network trains on the data collected by the agent to solve the prediction problem. In other words, with the data collected by the agent, the predictor network $\hat{f}$ is trained via gradient descent to minimize the MSE error:</p>

\[|| \hat{f}(x;\theta) - f(x) ||^2\]

<p>This training process <strong>distills</strong> a randomly initialized (target) network into a trained (predictor) network.</p>

<p>Similar to that of the <a href="https://arxiv.org/abs/1705.05363">Intrinsic Curiosity Module (ICM; prior work by Pathak et al., 2017)</a>, the prediction error is low on states that are similar to states already visited. In contrast, the prediction error is higher for novel states that are different from the states the predictor network has been trained on. Thus, the intrinsic reward $i_t$ is defined as the MSE error of the two networks $f$ and $\hat{f}$.</p>

<figure align="center">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/diagram.png" alt="" />
  <figcaption>A diagram illustrating the RND algorithm.</figcaption>
</figure>

<p>To test the validity of detecting novelty through the prediction error of target and predictor networks, the authors train a toy model with MNIST. The predictor neural network is trained on a mixed dataset of images with two classes: the 0 class and the target class (ex. 1). The 0 class represents states that have been seen many times before, and the target class represents novel states. With various proportions of 0 class to target class while keeping the total amount of data constant, the experiments show that the test error decreases when more target class data is available.</p>

<figure align="center" style="margin: 0 auto; width: 50%;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/fig2.png" alt="Figure 2" />
  <figcaption>Novelty detection on MNIST. From Figure 2 of this paper.</figcaption>
</figure>

<p>In this MNIST experiment, the MSE loss never reaches 0. This means that the predictor network is not able to mimic the target random network perfectly. This is desirable, as it implies that “standard gradient-based methods do not overgeneralize” such that the intrinsic reward becomes 0.</p>

<p>Empirically, in Montezuma’s Revenge, the spikes in the intrinsic reward (or the prediction error) correspond to meaningful events: losing a life (2, 8, 10, 21), escaping an enemy by a narrow margin (3, 5, 6, 11, 12, 13, 14, 15), passing a difficult obstacle (7, 9, 18), or picking up an object (20, 21).</p>

<figure align="center" style="margin: 0 auto; width: 80%;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/fig1.png" alt="Figure 1" />
  <figcaption>RND exploration bonus over an episode where the agent first successfully picks up the torch. From Figure 1 of this paper.</figcaption>
</figure>

<h3 id="section-2-2-1">2.2.1 Sources of Prediction Errors</h3>

<p>Generally, in deep learning, prediction error can be attributed to four factors:</p>

<ol>
  <li><strong>Amount of training data</strong>: Prediction error is high because the predictor fails to generalize from previously seen examples.</li>
  <li><strong>Stochasticity</strong>: Prediction error is high because the prediction target is stochastic.</li>
  <li><strong>Model misspecification</strong>: Prediction error is high because the information necessary for prediction is missing, or because the predictor’s model is too limited to model the complexity of the prediction target.</li>
  <li><strong>Learning dynamics</strong>: Prediction error is high due to failing to find the best approximation of the prediction target in the optimization process.</li>
</ol>

<p>Factor 1 is a useful source of error since it validates the use of RND. However, other sources of prediction errors can create undesirable effects in prediction-based exploration methods.</p>

<p>The most famous example is the noisy-TV problem, relevant to factor 2. Consider a maze environment with visual input. In this deterministic environment, maximizing prediction error would be beneficial, since it rewards exploring unvisited areas. Now, suppose there is a noisy TV attached to a wall inside the maze. Now, if the agent ever looks at the TV, it will always receive a high reward, due to its randomness.</p>

<figure align="center" style="margin: 0 auto; width: 60%;">
  <img style="margin: 0 auto; width: 50%" src="/public/images/2021-12-01-exploration-by-random-network-distillation/noisy-tv.gif" alt="The Noisy TV Problem" />
  <figcaption>The noisy-TV problem where an agent is stuck watching a noisy TV. From <em>Reinforcement Learning with Prediction-Based Rewards</em> by OpenAI.</figcaption>
</figure>

<p>Although this example might feel too artificial, prediction-based exploration was shown to be attracted to the inherent stochasticity of the environment. This includes Montezuma’s Revenge.</p>

<figure align="center" style="margin: 0 auto; width: 60%;">
  <img style="margin: 0 auto; width: 50%" src="/public/images/2021-12-01-exploration-by-random-network-distillation/noisy-montezuma.gif" alt="The Noisy TV Problem" />
  <figcaption>The noisy-TV problem in <em>Montezuma's Revenge</em>. Agent abuses changing rooms to gain high prediction errors. From <em>Reinforcement Learning with Prediction-Based Rewards</em> by OpenAI.</figcaption>
</figure>

<p>Previous methods tried to avoid these factors by using the relative improvement of the prediction error $\Delta E$, rather than the absolute error $E$. Sadly, this is difficult to implement efficiently. In contrast, RND obviates both factors 2 and 3. The target network is fixed, so it is deterministic, not stochastic. Also, the target network and the predictor network have the same architecture, so the model cannot be limited.</p>

<h3 id="section-2-2-2">2.2.2 Relation to Uncertainty Quantification</h3>

<p>It is possible to see the prediction error of RND as a <strong>quantification of uncertainty</strong>.</p>

<p>Consider a regression problem with the data distribution $D = \{x_i, y_i\} _ i$. In the Bayesian setting, we would consider a prior $p(\theta^* )$ over the parameters of a mapping $f_{\theta^*}$, then calculate the posterior after updating on the evidence.</p>

<p>The authors follow Lemma 3 of <a href="https://arxiv.org/abs/1806.03335">Osband et al. (2018)</a>.</p>

<figure align="center">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/lemma3.png" alt="" />
  <figcaption>From <em>Randomized Prior Functions for Deep Reinforcement Learning</em> (Osband et al., 2018)</figcaption>
</figure>

<p>Let $\mathcal{F}$ be the distribution over functions $g_\theta = f_\theta + f_{\theta^* }$ (ensemble). $\theta^ *$ is drawn from the prior $p(\theta^ *)$, and $\theta$ is given by minimizing the expected prediction error</p>

\[\theta = \text{argmin}_ \theta \mathbb{E}_{(x_i, y_i) ~ D} || f_\theta(x_i) + f_{\theta^*}(x_i)-y_i||^2 + \mathcal{R}(\theta)\]

<p>where $\mathcal{R}(\theta)$ is a regularization term shown at the end of equations (4) and (5) of Lemma 3.</p>

<p>Now, let us confine the regression problem to predicting the constant zero function $y_i = 0$.</p>

\[\theta = \text{argmin}_ \theta \mathbb{E}_{(x_i, y_i) ~ D} || f_\theta(x_i) + f_{\theta^*}(x_i)||^2 + \mathcal{R}(\theta)\]

<p>Then, the optimization problem is equivalent to distilling a randomly drawn function from the prior. With $f_\theta^*$ being the target and $f_\theta$ being the predictor, the distillation error can be seen as a quantification of uncertainty in predicting the constant zero function $y_i = 0$.</p>

<h3 id="section-2-3">2.3 Combining Intrinsic and Extrinsic Returns</h3>

<h4 id="intrinsic-reward-and-non-episodic-environment">Intrinsic Reward and Non-episodic Environment</h4>

<p>When using only intrinsic reward, the authors explore changing the problem as non-episodic. In other words, returns remain untruncated when the game is over. There are several justifications for this. First, it tells the agent that its intrinsic return should be related to all the novel states that it could find in all future episodes, not just this episode. Also, using episodic intrinsic rewards can leak information about the task to the agent, so it no longer becomes intrinsic-only (Burda et al., 2018).</p>

<figure align="center">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/death-is-not-the-end.png" alt="" />
  <figcaption>From <em>Large-Scale Study of Curiosity-Driven Learning</em> (Burda et al., 2018)</figcaption>
</figure>

<p>The authors argue that this approach is also closer to how humans explore games. Suppose Alice is playing a tricky part of the game where it is easy to fail. If she succeeds, then she will fulfill her curiosity, so the reward is high. If she fails, she has to repeat the “boring” task, so the reward should be small. However, if Alice is modeled as an episodic agent, the return of game over is 0 by definition, which could be a high reward depending on the environment. Thus, in some environments, Alice will be overly risk-averse, not considering the “boredom” from game over.</p>

<p>For empirical results, check <a href="#section-3-1">Section 3.1</a>.</p>

<h4 id="extrinsic-reward-and-episodic-environment">Extrinsic Reward and Episodic Environment</h4>

<p>However, when we use extrinsic rewards, we should use the episodic problem setting. If we use non-episodic returns, the agent could find a strategy to exploit this setting by finding an extrinsic reward close to the beginning of the game and deliberately dying quickly. This can be seen as <strong>reward farming</strong>, a common phenomenon when the reward function is designed inappropriately.</p>

<figure align="center">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/reward-farming.gif" alt="" />
    <figcaption>Agent exploiting <em>Blades of Vengeance</em>. From <a href="https://blog.openai.com/gym-retro/">Gym Retro</a> by OpenAI.</figcaption>
</figure>

<h4 id="combining-intrinsic-and-extrinsic-reward">Combining Intrinsic and Extrinsic Reward</h4>

<p>Intrinsic rewards benefit from a non-episodic setting, while extrinsic rewards benefit from an episodic setting. We want a dense reward signal, so we want to use both intrinsic and extrinsic rewards, but it is nontrivial to estimate the combined return from two streams of rewards.</p>

<p>The authors solve this by fitting two value heads $V_E$ and $V_I$ separately to their respective returns. $V_E$ estimates the cumulative extrinsic reward, while $V_I$ estimates the cumulative intrinsic reward. These two value heads are added to get the value function $V = V_E + V_I$.</p>

<p>Fitting two value heads can have a bonus effect: the extrinsic reward function is stationary, while the intrinsic reward function is non-stationary. If we were to use a single value function $V$, it would need to estimate a non-stationary reward function. However, with two value heads, $V_E$ can focus on the stationary reward function.</p>

<p>For empirical results, check <a href="#section-3-2">Section 3.2</a>.</p>

<h4 id="separate-value-functions">Separate Value Functions</h4>

<p>The above section discussed fitting two value heads above in the context of combining two reward streams with different problem settings. However, the same idea can also be used to combine reward streams with different discount factors $\gamma$.</p>

<p>For empirical results, check <a href="#section-3-3">Section 3.3</a>.</p>

<h2 id="section-3">3 Experiments</h2>

<p>The majority of the experiments in this paper are tested on <em>Montezuma’s Revenge</em>. This environment has been found to be the hardest for agents to explore without access to expert demonstrations or underlying emulator states. Two metrics are used: mean episodic return and mean number of rooms found.</p>

<h3 id="section-3-1">3.1 Pure Exploration</h3>

<p>In this section, the authors test the hypothesis in <a href="#section-2-3">Section 2.3</a> that the non-episodic setting is a more natural setting when only the intrinsic reward $i_t$ is used.</p>

<figure align="center" style="width: 50%; margin: 0 auto;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/figure3.png" alt="" />
  <figcaption>Mean episodic return and mean number of rooms on <em> Montezuma's Revenge </em> when trained without extrinsic reward. From Figure 3 of this paper.</figcaption>
</figure>

<p>Since the agent is using the intrinsic reward only, it is not directly optimizing for either metric. However, to get a high intrinsic reward, the agent needs to find novel states, including finding the key and opening the room with that key. Thus, the agent shows improvement over time in both metrics.</p>

<p>Note that the mean episodic returns are somewhat inconsistent: it decreases slowly after increasingly sharply for 0.4 billion frames. This is because once the agent learned how to use an item or reach a room, it is no longer interesting to the agent, so the intrinsic reward of performing such actions is low. Therefore, even though the agent reaches more and more rooms, it receives less and less rewards.</p>

<h3 id="section-3-2">3.2 Combining Episodic and Non-episodic Returns</h3>

<p>In the experiment above, episodic and non-episodic settings were compared with agents only trained on intrinsic rewards. A natural experiment to follow would be to compare these settings again using both intrinsic and extrinsic rewards. Extrinsic rewards are fixed to be episodic, and both episodic and non-episodic settings are tested for intrinsic rewards. If both rewards are episodic, it is possible to use a single value head, which the authors also experiment with.</p>

<figure align="center" style="width: 50%; margin: 0 auto;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/figure6b.png" alt="" />
  <figcaption>Mean episodic return and mean number of rooms on <em> Montezuma's Revenge </em> for different combination strategies of intrinsic and extrinsic rewards using CNN policy. From Figure 6 (b) of this paper.</figcaption>
</figure>

<p>Contrary to the author’s expectations, using two value heads with non-episodic intrinsic reward and episodic extrinsic reward did not show any benefit over other methods. Nevertheless, the remaining experiments still use two value heads with non-episodic intrinsic rewards.</p>

<p>Similar experiments are performed with RNN policies, but they consistently have worse performance than CNN policies. Check <a href="#section-3-4">Section 3.4</a> below for more details.</p>

<h3 id="section-3-3">3.3 Discount Factors</h3>

<p>Previous state-of-the-art works of <em>Montezuma’s Revenge</em> reported better performance using higher discount factors since it allows the agent to look further into the future. A standard discount factor has been 0.99, but higher values have shown better performance for algorithms that can handle the instability of higher values. Increasing discount factor means that the agent looks further ahead into the future, thus resulting in increased variance. Therefore, the discount factor is an important hyperparameter that should be tuned, as shown below.</p>

<figure align="center" style="width: 50%; margin: 0 auto;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/higher_discount_factor.png" alt="" />
  <figcaption>Effect of higher discount factor (&gt; 0.99). Figure 3 (b) from <em>Expert-augmented actor-critic for ViZDoom and Montezuma’s Revenge</em> (Garmulewicz, Michalewski, and Miłos´, 2018).</figcaption>
</figure>

<p>Following these previous works highlighting the importance of high discount factors, the authors compare different values of $\gamma_I, \gamma_E$.</p>

<figure align="center" style="width: 50%; margin: 0 auto;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/figure4.png" alt="" />
  <figcaption>Mean episodic return and mean number of rooms on <em> Montezuma's Revenge </em> for different discount factors. From Figure 4 of this paper.</figcaption>
</figure>

<p>We see that $\gamma_I = 0.99$ and $\gamma_E = 0.999$ yield the best result, with a mean return of 11.5K.</p>

<h3 id="section-3-4">3.4 Recurrence</h3>

<p>Montezuma’s Revenge is a partially observable environment. The observation only includes information about the current room and the number of keys the player has. From the observation, the agent cannot deduce where the keys came from, how many were used, or which doors are open.</p>

<p>To deal with this partial observability, it is possible to reformulate a state as a summary of the past using a recurrent neural network (RNN). This is a similar approach to deep recurrent Q-network (DRQN).</p>

<figure align="center" style="width: 50%; margin: 0 auto;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/drqn.png" alt="" />
    <figcaption>The DRQN architecture From <em>Deep Q-Learning with Recurrent Neural Networks</em> (Chen et al., 2015)</figcaption>
</figure>

<p>To discern, the new state formulation is labeled the <strong>RNN policy</strong>, and the old state formulation using just the visual observation is called the <strong>CNN policy</strong>.</p>

<figure align="center" style="width: 80%; margin: 0 auto;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/figure6.png" alt="" />
    <figcaption>Comparison of mean episodic return and mean number of rooms on <em> Montezuma's Revenge </em> for RNN and CNN policies. From Figure 6 of this paper.</figcaption>
</figure>

<p>To the surprise of the authors, RNN policy results in worse performance compared to CNN policy.</p>

<h3 id="section-3-5">3.5 Scaling Up RNN Training</h3>

<p>In this section, the authors further investigate RNN policies to show the effect of the increased scale of parallel environments. For all experiments in this section, intrinsic rewards are non-episodic, and $\gamma_I = 0.99, \gamma_E = 0.999$.</p>

<p>Agents are tested with $[32, 128, 256, 1024]$ parallel environments. For a fair comparison of environments, the batch size must be fixed. This is because having a larger batch size results in the predictor network learning quickly, resulting in a rapid decrease in the intrinsic reward function. Thus, when we increase the number of environments from 32 to 128 (4 times), 75% of the elements are randomly dropped out, keeping just 25%. Similarly, when we scale up from 32 to 256 and 1024, we keep just 12.5% and 3.125% of the batch.</p>

<figure align="center" style="width: 50%; margin: 0 auto;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/figure5.png" alt="" />
    <figcaption>Performance of RNN agents with a different number of parallel environments. From Figure 5 of this paper.</figcaption>
</figure>

<p>As predicted, the agent performs better with more parallel environments. With 1024 environments, the RNN RND agent had a mean episodic return of 10070 with the best return of 14415.</p>

<p>Separately, the authors allowed the RNN RND agent with 32 environments to train for 1.6M parameter updates (1.6B frames). This agent had a mean episodic return of 7570, and <strong>the best run was able to achieve a return of 17500, visiting all 24 rooms and completing the first level.</strong></p>

<figure align="center" style="margin: 0 auto;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/pyramid.gif" alt="" />
    <figcaption>Mean of RNN RND agents with 32 parallel environments. From <a href="https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/">Reinforcement Learning with Prediction-Based Rewards</a> by OpenAI.</figcaption>
</figure>

<h3 id="section-3-6">3.6 Comparison to Baselines</h3>

<p>To compare it with two existing works, RND is also trained on 6 hard exploration Atari 2600 games: Gravitar, Montezuma’s revenge, Pitfall!, Private Eye, Solaris, and Venture.</p>

<p>The first baseline is the <strong>“vanilla” Proximal Policy Optimization (PPO)</strong> agent, without any exploration bonus.</p>

<p>The second baseline is PPO with a different exploration bonus mechanism based on forward dynamics error. There are numerous works on designing intrinsic rewards with forward dynamics, as described in <a href="#section-4-1">Section 4.1</a>. Among those, the authors select the <strong>Intrinsic Curiosity Module (ICM)</strong>. It is a good representative of prior methods using forward dynamics error.</p>

<p>Furthermore, <a href="https://arxiv.org/abs/1808.04355">Burda et al. (2018)</a> showed that training a forward dynamics model in a <strong>random feature (RF) space</strong> works as well as any other feature space most of the time, so the authors use the RF space instead (ICM-RF). RND and ICM-RF are quite similar, allowing for a direct comparison of algorithms while fixing other parts of the methods such as dual value heads, non-episodic intrinsic returns, normalization schemes, etc.</p>

<figure align="center" style="margin: 0 auto;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/burda-rf.png" alt="" />
    <figcaption>Different feature spaces experimented for training a forward dynamics model. Figure 2 from <em>Large-Scale Curiosity-Driven Learning</em> (Burda et al., 2018)</figcaption>
</figure>

<figure align="center" style="margin: 0 auto; width: 50%;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/figure7.png" alt="" />
    <figcaption>Performances of PPO, RND, and ICM-RF (labeled CNN policy, dynamics). Figure 7 from this paper.</figcaption>
</figure>

<figure align="center" style="margin: 0 auto; width: 50%;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/table5.png" alt="" />
    <figcaption>Performances of PPO, RND, and ICM-RF (labeled DYN CNN). Table 5 from this paper.</figcaption>
</figure>

<p>RND achieves new state-of-the-art for <em>Gravitar</em> and <em>Montezuma’s Revenge</em> and competes for the state-of-the-art in <em>Venture</em>. RND gets a sub-state-of-the-art score on <em>Private Eye</em> and <em>Solaris</em> but is better than PPO and ICM-RF. Like all other methods, RND fails to get a positive score for <em>Pitfall</em>.</p>

<h3 id="section-3-7">3.7 Qualitative Analysis: Dancing with Skulls</h3>

<p>Observing the RND agent, the authors found that once the agent obtains all the extrinsic rewards it knows how to obtain reliably, it continues to interact with potentially dangerous objects. For instance, in <em>Montezuma’s Revenge</em>, the agent jumps back and forth over a moving skill that upon contact makes the agent lose its life. Similarly, in <em>Pitfall</em>, the agent repeatedly “dances” with the rope and the scorpion.</p>

<!-- <figure>
    <div style="display: grid;">
        <div style="grid-row: 1; grid-column: 1;">
            <video muted controls>
                <source src="/public/images/2021-12-01-exploration-by-random-network-distillation/pitfall_rope_dance.mp4" type="video/mp4"/>
            </video>
        </div>
        <div style="grid-row: 1; grid-column: 2;">
            <video controls>
                <source src="/public/images/2021-12-01-exploration-by-random-network-distillation/pitfall_scorpion_dance.mp4" type="video/mp4"/>
            </video>
        </div>
    </div>
<figcaption>RND agent videos from ICLR 2019 submission</figcaption>
</figure> -->

<p>The authors speculate that the agent adapts this behavior because such dangerous states are difficult to achieve or stay alive, it is therefore rarely represented in the agent’s past experience compared to safer states.</p>

<p>The videos can be found by <a href="https://drive.google.com/drive/folders/15q5RnbK6qPWLr-ifzZBY1HlP1BWiEPCj?usp=sharing">this Google Drive folder link</a> shared by the authors.</p>

<h2 id="section-4">4 Related Work</h2>

<h3 id="section-4-1">4.1 Exploration</h3>

<p>To encourage exploration, the intrinsic reward $i_t$ should be designed so that it is higher in novel states than in frequently visited states. If the environment was so simple that the states and their visitation counts can be represented by a table, we can tally the number of visits at each state. If the environment was a 5x5 grid, we only need to keep track of 25 numbers. In such tabular cases, we can define $i_t$ as a decreasing function of the visitation count $n(s)$. These are called <strong>count-based exploration methods</strong>.</p>

\[i_t = \frac{\beta}{n(s)}, \frac{\beta}{\sqrt{n(s)}}\]

<p>where $\beta$ is an optional coefficient to tune exploration. However, most interesting environments are much more complex. For example, if the state space were a real line, and the agent that starts at a random number can move left or right by any distance, most states will be visited at most once. In such non-tabular cases, it is difficult to define a visitation count. A possible generalization is to define a <strong>pseudo-count</strong>, using state density estimates $N(s)$ as an exploration bonus. Using density estimates, even states that have never been visited have a positive pseudo-count if it is similar to other visited states.</p>

<p>Another way to design the intrinsic reward $i_t$ is to define it with a <strong>prediction error</strong> for a problem related to the agent’s transitions. <strong>Dynamics prediction methods</strong> are exploration methods that predict the environment dynamics and use the prediction error to define the exploration bonus. Simply using the prediction value makes the agent susceptible to the “noisy-TV” problem in a stochastic or partially observable environment, so different metrics such as measuring prediction improvement are used.</p>

<figure align="center">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/schmidhuber91.png" alt="Schmidhuber" />
    <figcaption>Curiosity proposed by Schmidhuber in 1991. From <em>A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Networks</em> (Schmidhuber, 1991)</figcaption>
</figure>

<p>The most relevant example would be the <strong>Intrinsic Curiosity Module</strong> (Pathak et al., 2017; Burda et al., 2018). The Intrinsic Curiosity Module (ICM) trains forward model that outputs a prediction $\hat{\phi}(s_{t+1})$ that attempts to predict the encoded next state $\phi(s_{t+1})$ given encoded state $\phi(s_t)$ and action $a_t$. The intrinsic reward $r^i_t$ is defined as the prediction error of the forward model. The forward model is trained as the agent explores the environment. Thus, low prediction error means that the ICM has understood the transition $(s_t, a_t)$.</p>

<figure align="center">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-exploration-by-random-network-distillation/icm.png" alt="" />
  <figcaption>Diagram illustrating the ICM algorithm. From <em>Curiosity-driven Exploration by Self-supervised Prediction</em> (Pathak et al., 2017)</figcaption>
</figure>

<p>Other exploration methods include adversarial self-play, empowerment maximization, parameter noise injection, option discovery, and ensembles.</p>

<h3 id="section-4-2">4.2 Montezuma’s Revenge</h3>

<p>Commonly known as one of the hardest problems of <em>Atari 2600</em> since the birth of <a href="https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf">Deep Q-Networks (DQN; Mnih et al., 2015)</a>, <em>Montezuma’s Revenge</em> has been a standard benchmark for exploration algorithms.</p>

<p><strong>Without any explicit exploration bonus</strong>, early deep reinforcement learning algorithms such as DQNs failed to make meaningful progress. However, in 2018, <a href="https://arxiv.org/abs/1803.00933">Ape-X (Horgan et al., 2018)</a>, <a href="https://arxiv.org/abs/1802.01561">IMPALA (Espeholt et al., 2018)</a>, and <a href="https://arxiv.org/abs/1806.05635">Self-Imitation Learning (SIL; Oh et al., 2018)</a> showed that even without such bonus, it is possible to achieve a score of 2500.</p>

<p>Using <strong>pseudo-count exploration bonus</strong> discussed above allowed for new state-of-the-art performance, as shown by <a href="https://arxiv.org/abs/1606.01868">DQN-CTS (Bellemare et al., 2016)</a> and <a href="https://arxiv.org/abs/1703.01310">DQN-PixelCNN (Ostrovski et al., 2017)</a>.</p>

<p>Some have also improved exploration by <strong>using the internal RAM state</strong> available, hand-crafting exploration bonuses. Despite such access, these methods still achieved below the score of an average human.</p>

<p><strong>Expert demonstrations</strong> have been used to simplify the exploration problem. With this information, multiple methods such as atari-reset achieved superhuman performance. However, learning from expert demonstrations exploits the deterministic nature of the environment. To prevent the agent from simply memorizing the expert’s sequence of actions, newer methods have been tested with the stochastic variant with <strong>sticky actions</strong> (each action repeated with some probability).</p>

<h3 id="section-4-3">4.3 Random Features</h3>

<p>Using the features of a randomly initialized neural network has been extensively studied in the context of supervised learning. It has also recently been used in reinforcement learning as an exploration technique by <a href="https://arxiv.org/abs/1806.03335">Osband et al. (2018)</a> and <a href="https://arxiv.org/abs/1808.04355">Burda et al. (2018)</a>. This work was motivated by Osband et al. as shown in <a href="#section-2-2">Section 2.2</a>, as the authors use a lemma from this work. The work by Burda et al. was used as a baseline in <a href="#section-3-6">Section 3.6</a>.</p>

<h3 id="section-4-4">4.4 Vectorized Value Functions</h3>

<p>The idea of a vectorized value function was used in <a href="https://arxiv.org/abs/1802.09081">Temporal Difference Models (TDM; Pong et al.,2018)</a> and <a href="https://arxiv.org/abs/1707.06887">C51 (Bellmare et al., 2017)</a>.</p>

<h2 id="section-5">5 Discussion</h2>

<p>RND was able to use directed exploration to achieve high performance in Atari games despite its simplicity. This suggests that when applied at scale, even simple exploration methods can solve hard exploration games. The results also suggest that methods that can treat intrinsic and extrinsic rewards separately can benefit from such flexibility.</p>

<p>RND is enough to deal with <strong>local exploration</strong>: exploring the consequences of short-term decisions, like choosing to interact or avoid a particular object. However, the authors discuss that RND does not perform <strong>global exploration</strong> that involves coordinating decisions over long time horizons.</p>

<p>To understand global exploration, let us consider <em>Montezuma’s Revenge</em>. The RND agent is good at exploring short-term decisions: it can choose to use or avoid the ladder, key, skull, or other objects. However, <em>Montezuma’s Revenge</em> requires more than these local explorations. In the first level of <em>Montezuma’s Revenge</em>, there are four keys and six locked doors spread throughout the level. Any key can open any door, but the key is consumed in the process. To solve the first level, the agent must enter a room locked behind two doors, so the agent must not open the two other doors that are easier to find, even though they would be rewarded for opening them. This requires <strong>global exploration</strong> through long-term planning.</p>

<p>How can we convince the agent to make such behavior? Since not opening the other two doors results in a loss of rewards, the agent should receive enough intrinsic reward to compensate for the loss of extrinsic rewards. The authors suspect that the RND agent does not seem to get enough incentive through intrinsic rewards to try this strategy, and thus it rarely manages to finish the level.</p>

<h2 id="final-thoughts">Final Thoughts</h2>

<p><strong>Questions</strong></p>

<ul>
  <li>The authors argue that RND trivializes the noisy-TV problem. However, can’t “dancing with skulls” be thought of as a variant of the noisy-TV problem?</li>
  <li>In simulated environments, “dancing with skulls” is just an interesting observation. However, if the agent is deployed in real life, we would like the agent to stay away from danger. (For example, robot deployed in a firefighting operation.) Are there methods to discourage such behavior after training has finished? Can they coexist with exploration methods?</li>
  <li>In <a href="#section-5">Section 5</a>, the authors argue that RND shows how dividing intrinsic and extrinsic rewards could benefit the agent. However, single value head seems to do just as well as double value heads (Figure 6 in <a href="#section-3-2">Section 3.2</a>). Do most of this benefit come from being able to fine-tune the discount factor (Figure 4)?</li>
</ul>

<p><strong>Recommended Next Papers</strong></p>

<ul>
  <li><a href="https://arxiv.org/abs/1810.02274">Episodic Curiosity through Reachability (Savinov et al., 2018)</a>: This paper approaches exploration in a different way to solve the problem of global exploration discussed in <a href="#section-5">Section 5</a>.</li>
  <li><a href="https://arxiv.org/abs/1901.10995">Go-Explore: a New Approach for Hard-Exploration Problems (Ecoffet et al., 2019)</a>: This paper exploits the simulator being resettable and deterministic to repeatedly explore “promising states.”</li>
  <li><a href="https://arxiv.org/abs/2002.06038">Never Give Up: Learning Directed Exploration Strategies (Badia et al., 2020)</a>: This paper uses RND as a long-term novelty module to combine with another episodic novelty module. This algorithm was integrated into <a href="https://arxiv.org/abs/2003.13350">Agent57</a> which achieved superhuman performance on all 57 Atari environments.</li>
</ul>

</div>

<div id="bibtex-container" class="related">
  For attribution in academic contexts, please cite this work as
  <pre id="bibtex-academic-attribution">

  </pre>

  BibTeX citation
  <pre id="bibtex-box">

  </pre>
</div>
<script>
  let authorsSpan = document.getElementById("iclr-post-authors");
  let authorsText = authorsSpan.textContent;
  let lnameFnameInstitution = authorsText.split(";");
  let lfiList = lnameFnameInstitution.map(lfi => lfi.split(",").map(item => item.trim()));
  let bibtexLFI = lfiList.map(lfi => lfi[0] + ", " + lfi[1]).join(" and ")
  let academicLFI = lfiList.map(lfi => lfi[0]);
  {
    if(academicLFI.length > 2) academicLFI = academicLFI[0] + ", et al.";
    else if(academicLFI.length == 2) academicLFI = academicLFI[0] + " & " + academicLFI[1];
    else academicLFI = academicLFI[0];
  }

  let titleSpan = document.getElementById("iclr-post-title");
  let titleText = titleSpan.textContent.trim();
  let bibtexTitleShorthand = (lfiList[0][1]+
    "2022"+
    titleText.split(" ").slice(0, 3).join("")
  ).replace(" ", "").replace(/[\p{P}$+<=>^`|~]/gu, '').toLowerCase().trim();

  let bibtexTemplate = `
@inproceedings{${bibtexTitleShorthand}},
  author = {${bibtexLFI}},
  title = {${titleText}},
  booktitle = {ICLR Blog Track},
  year = {2022},
  note = {${window.location.href}},
  url  = {${window.location.href}}
}
  `.trim();
  document.getElementById("bibtex-box").innerText = bibtexTemplate;

  let academicTemplate = `
${academicLFI}, "${titleText}", ICLR Blog Track, 2022.
`.trim();
  document.getElementById("bibtex-academic-attribution").innerText = academicTemplate;

</script>


<div class="related">
  <h2>Related posts</h2>
  <ul class="related-posts">
    
      <li>
        <h3>
          <a href="/2021/09/01/sample-submission/">
            Sample Submission
            <small>01 Sep 2021 | 
    <a class="content-tag" href="/tags/#machine-learning"> machine learning </a>
  
    <a class="content-tag" href="/tags/#deep-learning"> deep learning </a>
  
    <a class="content-tag" href="/tags/#reinforcement-learning"> reinforcement learning </a>
  
    <a class="content-tag" href="/tags/#exploration"> exploration </a>
  </small>
          </a>
        </h3>
      </li>
    
      <li>
        <h3>
          <a href="/2020/04/02/example-content/">
            Example content (Basic Markdown)
            <small>02 Apr 2020 | 
    <a class="content-tag" href="/tags/#machine-learning"> machine learning </a>
  
    <a class="content-tag" href="/tags/#deep-learning"> deep learning </a>
  
    <a class="content-tag" href="/tags/#reinforcement-learning"> reinforcement learning </a>
  
    <a class="content-tag" href="/tags/#exploration"> exploration </a>
  </small>
          </a>
        </h3>
      </li>
    
  </ul>
</div>


<script src="https://utteranc.es/client.js"
        repo="iclr-blog-track/iclr-blog-track.github.io"
        issue-term="pathname"
        label="utterance"
        theme="boxy-light"
        crossorigin="anonymous"
        >
</script>


      </div>
    </div>

    <label for="sidebar-checkbox" class="sidebar-toggle"></label>

    <script src='/public/js/script.js'></script>
  </body>
</html>
