Bootstrapped Hindsight Experience replay with Counterintuitive PrioritizationDownload PDF

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone
Keywords: Reinforcement learning, hindsight experience replay, counterintuitive prioritization
Abstract: Goal-conditioned environments are known as sparse rewards tasks, in which the agent gains a positive reward only when it achieves the goal. Such an setting results in much difficulty for the agent to explore successful trajectories. Hindsight experience replay (HER) replaces the goal in failed experiences with any practically achieved one, so that the agent has a much higher chance to see successful trajectories even if they are fake. Comprehensive results have demonstrated the effectiveness of HER in the literature. However, the importance of the fake trajectories differs in terms of exploration and exploitation, and it is usually inefficient to learn with a fixed proportion of fake and original data as HER did. In this paper, inspired by Bootstrapped DQN, we use multiple heads in DDPG and take advantage of the diversity and uncertainty among multiple heads to improve the data efficiency with relabeled goals. The method is referred to as Bootstrapped HER (BHER). Specifically, in addition to the benefit from the Bootstrapped version, we explicitly leverage the uncertainty measured by the variance of estimated Q-values from multiple heads. A common knowledge is that higher uncertainty will promote exploration and hence maximizing the uncertainty via a bonus term will induce better performance in Q-learning. However, in this paper, we reveal a counterintuitive conclusion that for hindsight experiences, exploiting lower uncertainty data samples will significantly improve the performance. The explanation behind this fact is that hindsight relabeling largely promotes exploration, and then exploiting lower uncertainty data (whose goals are generated by hindsight relabeling) provides a good trade-off between exploration and exploitation, resulting in further improved data efficiency. Comprehensive experiments demonstrate that our method can achieve state-of-the-art results in many goal-conditioned tasks.
One-sentence Summary: We reveal a counterintuitive conclusion that for hindsight experiences, exploiting lower uncertainty data samples will significantly improve the performance.
Supplementary Material: zip
14 Replies

Loading