Model-Based Offline RL with Online Adaptation
- Abstract: Model-based offline reinforcement learning (RL) algorithms have achieved promising results in learning effective policies from diverse static datasets without directly interacting with the environment as well as enabling the learned policy to generalize to behaviors that are different from what is shown in the offline data. However, prior model-based offline RL methods typically require per-task hyperparameter tuning via rolling out the learned policy in the real environment. Therefore, model-based offline RL algorithms are brittle with respect to hyperparameters and demand a large number of expensive and potentially unsafe online evaluation rollouts, defying the goal of learning fully offline policy that is robust and safe. To address this issue, we propose to cast the model-based offline single-task RL problem to an offline meta-RL problem, where we provide a wide range of the hyperparameter candidates and consider the model-generated data under each hyperparameter setting as a task. We perform the standard meta-RL algorithm on this multi-task dataset to meta-learn a context variable that corresponds to each hyperparameter setting and is generalizable. At test time, we perform a small number of online rollouts and use the offline meta-RL agent to infer the optimal context, on which the policy is conditioned. Therefore, our method is able to automatically select the optimal hyperparameter with a small number of online trials, bypassing the need of per-task hyperparameter selection. We evaluate our approach in standard offline RL benchmarks as well as domains where generalization is required and find that our method outperforms prior model-based offline RL algorithms with uniform hyperparameters while performing competitively to the per-task hyperparameter tuning scheme.