Offline Workflows for Offline Reinforcement Learning Algorithms
- Abstract: Offline reinforcement learning (RL) methods promise to convert large and diverse datasets into effective control policies, without needing active online interaction. However, despite recent algorithmic improvements in offline RL, applying these methods to real-world problems has proven to be challenging. Current offline RL methods are sensitive to hyperparameters, and tuning these hyperparameters requires online rollouts or some proxy "validation" task, both of which can be difficult. Even off-policy evaluation (OPE) methods specifically designed for model selection suffer from similar hyperparameter tuning challenges. In this paper, we devise a principled approach for offline tuning of offline RL algorithms that does not require explicit OPE. We focus on the class of conservative offline RL algorithms that use a weighted combination of the RL objective and a distributional shift constraint, where this weight is typically an important hyperparameter. Analogously to the highly effective workflow in supervised learning, which utilizes a held-out validation set, our approach provides a recipe for tuning hyperparameters using a combination of validation sets and various metrics that we recommend tracking over the course of training. Theoretically, we pick the optimal hyperparameter by maximizing a lower bound of the policy value, utilize the validation set in a theoretically sounded way, and provide a new algorithm modification that is more amenable to hyperparameter tuning.