Is Value Learning Really the Main Bottleneck in Offline RL?

Published: 19 Jun 2024, Last Modified: 26 Jul 2024ARLET 2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: offline reinforcement learning
Abstract: While imitation learning requires access to high-quality data, offline reinforcement learning (RL) should, in principle, perform similarly or better with substantially lower data quality. However, current results indicate that offline RL often performs worse than imitation learning, and it is often unclear what holds back the performance of offline RL. In this work, we aim to understand bottlenecks in current offline RL algorithms. While the worse performance of offline RL is typically attributed to an imperfect value function, we ask: *is the main bottleneck of offline RL indeed in learning the value function, the policy, or something else?* To answer this question, we perform a systematic empirical study of (1) value learning, (2) policy extraction, and (3) policy generalization in offline RL problems from the lens of “data-scaling” properties of each component, analyzing how these components affect performance. We make two surprising observations. First, the choice of a policy extraction algorithm affects the performance and scalability of offline RL significantly, often more so than its underlying value learning objective. For instance, widely used value-weighted regression objectives (e.g., AWR) are not able to fully leverage the learned value function, and switching to behavior-regularized policy gradient objectives (e.g., DDPG+BC) often leads to substantial improvements in performance and scaling behaviors. Second, the suboptimal performance of offline RL is often due to imperfect policy generalization on test-time states out of the support of the training data, rather than the policy accuracy on in-distribution states. While most current offline RL algorithms do not explicitly address this, we show that the use of suboptimal but high-coverage data or on-the-fly policy extraction techniques can be effective in addressing the policy generalization issue in practice.
Submission Number: 18
Loading