Keywords: RL, Reinforcement Learning, Replay Buffer
Abstract: Experience replay is an essential component in off-policy model-free reinforcement learning (MfRL). Due to its effectiveness, various methods for calculating priority scores on experiences have been proposed for sampling. Since critic networks are crucial to policy learning, TD-error, directly correlated to $Q$-values, is one of the most frequently used features to compute the scores. However, critic networks often under- or overestimate $Q$-values, so it is often ineffective to learn to predict $Q$-values by sampled experiences based heavily on TD-error. Accordingly, it is valuable to find auxiliary features, which positively support TD-error in calculating the scores for efficient sampling. Motivated by this, we propose a novel experience replay method, which we call model-augmented prioritized experience replay (MaPER), that employs new learnable features driven from components in model-based RL (MbRL) to calculate the scores on experiences. The proposed MaPER brings the effect of curriculum learning for predicting $Q$-values better by the critic network with negligible memory and computational overhead compared to the vanilla PER. Indeed, our experimental results on various tasks demonstrate that MaPER can significantly improve the performance of the state-of-the-art off-policy MfRL and MbRL which includes off-policy MfRL algorithms in its policy optimization procedure.
One-sentence Summary: We propose a novel experience replay which employs additional auxiliary learnable features as well as TD-errors for prioritizing experiences
Supplementary Material: zip