- Abstract: To use deep reinforcement learning in the wild, we might hope for an agent that can avoid catastrophic mistakes. Unfortunately, even in simple environments, the popular deep Q-network (DQN) algorithm is doomed by a Sisyphean curse. Owing to the use of function approximation, these agents eventually forget experiences as they become exceedingly unlikely under a new policy. Consequently, for as long as they continue to train, DQNs may periodically relive catastrophic mistakes. Many real-world environments where people might be injured exhibit a special structure. We know a priori that catastrophes are not only bad, but that agents need not ever get near to a catastrophe state. In this paper, we exploit this structure to learn a reward-shaping that accelerates learning and guards oscillating policies against repeated catastrophes. First, we demonstrate unacceptable performance of DQNs on two toy problems. We then introduce intrinsic fear, a new method that mitigates these problems by avoiding dangerous states. Our approach incorporates a second model trained via supervised learning to predict the probability of catastrophe within a short number of steps. This score then acts to penalize the Q-learning objective, shaping the reward function away from catastrophic states.
- TL;DR: Owing to function approximation, DRL agents eventually forget about dangerous transitions once they learn to avoid them, putting them at risk of perpetually repeating mistakes. We propose techniques to avert catastrophic outcomes.
- Keywords: Deep learning, Reinforcement Learning, Applications
- Conflicts: cs.ucsd.edu, microsoft.com