Keywords: deep reinforcement learning, off-policy deep reinforcement learning, safe reinforcement learning
TL;DR: We identify, explain, and remediate the tendency of popular off-policy deep reinforcement learning algorithms to struggle in stochastic environments where the reward function includes negatively correlated terms.
Abstract: The most popular approaches for off-policy deep reinforcement learning (DRL) with continuous action spaces include policy improvement steps where a learned state-action value ($Q$) function is maximized over selected batches of data. These algorithms also use the $Q$-function in the target of its own update, and end up overestimating the $Q$-values as a result. To combat this overestimation, these algorithms take a minimum over multiple $Q$ estimates in the value update target. We examine a setting, common in real-world applications, where improperly balancing these opposing sources of bias may have disastrous consequences: stochastic environments with reward functions comprised of multiple negatively-correlated terms. Reward terms corresponding to conflicting objectives will be negatively correlated; in expectation, gains in one will be accompanied by losses in the other. We find that standard approaches consistently fail to approach optimal performance when applied to a suite of robotic tasks in this category. We trace the failure to erroneous $Q$ estimation and propose a novel off-policy actor-critic algorithm that remediates the problem through the use of a policy gradient. Our algorithm significantly outperforms baseline approaches across such tasks, drastically reducing the total cost incurred by the agent throughout training.
Submission Number: 11
Loading