The authors speculate that the agent adapts this behavior because such dangerous states are difficult to achieve or stay alive, it is therefore rarely represented in the agent’s past experience compared to safer states.
The videos can be found by this Google Drive folder link shared by the authors.
To encourage exploration, the intrinsic reward $i_t$ should be designed so that it is higher in novel states than in frequently visited states. If the environment was so simple that the states and their visitation counts can be represented by a table, we can tally the number of visits at each state. If the environment was a 5x5 grid, we only need to keep track of 25 numbers. In such tabular cases, we can define $i_t$ as a decreasing function of the visitation count $n(s)$. These are called count-based exploration methods.
\[i_t = \frac{\beta}{n(s)}, \frac{\beta}{\sqrt{n(s)}}\]where $\beta$ is an optional coefficient to tune exploration. However, most interesting environments are much more complex. For example, if the state space were a real line, and the agent that starts at a random number can move left or right by any distance, most states will be visited at most once. In such non-tabular cases, it is difficult to define a visitation count. A possible generalization is to define a pseudo-count, using state density estimates $N(s)$ as an exploration bonus. Using density estimates, even states that have never been visited have a positive pseudo-count if it is similar to other visited states.
Another way to design the intrinsic reward $i_t$ is to define it with a prediction error for a problem related to the agent’s transitions. Dynamics prediction methods are exploration methods that predict the environment dynamics and use the prediction error to define the exploration bonus. Simply using the prediction value makes the agent susceptible to the “noisy-TV” problem in a stochastic or partially observable environment, so different metrics such as measuring prediction improvement are used.
The most relevant example would be the Intrinsic Curiosity Module (Pathak et al., 2017; Burda et al., 2018). The Intrinsic Curiosity Module (ICM) trains forward model that outputs a prediction $\hat{\phi}(s_{t+1})$ that attempts to predict the encoded next state $\phi(s_{t+1})$ given encoded state $\phi(s_t)$ and action $a_t$. The intrinsic reward $r^i_t$ is defined as the prediction error of the forward model. The forward model is trained as the agent explores the environment. Thus, low prediction error means that the ICM has understood the transition $(s_t, a_t)$.
Other exploration methods include adversarial self-play, empowerment maximization, parameter noise injection, option discovery, and ensembles.
Commonly known as one of the hardest problems of Atari 2600 since the birth of Deep Q-Networks (DQN; Mnih et al., 2015), Montezuma’s Revenge has been a standard benchmark for exploration algorithms.
Without any explicit exploration bonus, early deep reinforcement learning algorithms such as DQNs failed to make meaningful progress. However, in 2018, Ape-X (Horgan et al., 2018), IMPALA (Espeholt et al., 2018), and Self-Imitation Learning (SIL; Oh et al., 2018) showed that even without such bonus, it is possible to achieve a score of 2500.
Using pseudo-count exploration bonus discussed above allowed for new state-of-the-art performance, as shown by DQN-CTS (Bellemare et al., 2016) and DQN-PixelCNN (Ostrovski et al., 2017).
Some have also improved exploration by using the internal RAM state available, hand-crafting exploration bonuses. Despite such access, these methods still achieved below the score of an average human.
Expert demonstrations have been used to simplify the exploration problem. With this information, multiple methods such as atari-reset achieved superhuman performance. However, learning from expert demonstrations exploits the deterministic nature of the environment. To prevent the agent from simply memorizing the expert’s sequence of actions, newer methods have been tested with the stochastic variant with sticky actions (each action repeated with some probability).
Using the features of a randomly initialized neural network has been extensively studied in the context of supervised learning. It has also recently been used in reinforcement learning as an exploration technique by Osband et al. (2018) and Burda et al. (2018). This work was motivated by Osband et al. as shown in Section 2.2, as the authors use a lemma from this work. The work by Burda et al. was used as a baseline in Section 3.6.
The idea of a vectorized value function was used in Temporal Difference Models (TDM; Pong et al.,2018) and C51 (Bellmare et al., 2017).
RND was able to use directed exploration to achieve high performance in Atari games despite its simplicity. This suggests that when applied at scale, even simple exploration methods can solve hard exploration games. The results also suggest that methods that can treat intrinsic and extrinsic rewards separately can benefit from such flexibility.
RND is enough to deal with local exploration: exploring the consequences of short-term decisions, like choosing to interact or avoid a particular object. However, the authors discuss that RND does not perform global exploration that involves coordinating decisions over long time horizons.
To understand global exploration, let us consider Montezuma’s Revenge. The RND agent is good at exploring short-term decisions: it can choose to use or avoid the ladder, key, skull, or other objects. However, Montezuma’s Revenge requires more than these local explorations. In the first level of Montezuma’s Revenge, there are four keys and six locked doors spread throughout the level. Any key can open any door, but the key is consumed in the process. To solve the first level, the agent must enter a room locked behind two doors, so the agent must not open the two other doors that are easier to find, even though they would be rewarded for opening them. This requires global exploration through long-term planning.
How can we convince the agent to make such behavior? Since not opening the other two doors results in a loss of rewards, the agent should receive enough intrinsic reward to compensate for the loss of extrinsic rewards. The authors suspect that the RND agent does not seem to get enough incentive through intrinsic rewards to try this strategy, and thus it rarely manages to finish the level.
Questions
Recommended Next Papers
This post outlines a few more things you may need to know for creating and configuring your blog posts.