Abstract: Offline reinforcement learning (RL) is challenged by the distributional shift problem. To tackle this issue, existing works mainly focus on designing sophisticated policy constraints between the learned policy and the behavior policy. However, these constraints are applied equally to well-performing and inferior actions through uniform sampling, which might negatively affect the learned policy. In this article, we propose offline decoupled prioritized resampling (ODPR), which designs specialized priority functions for the suboptimal policy constraint issue in offline RL and employs unique decoupled resampling for training stability. Through theoretical analysis, we show that the distinctive priority functions induce a provable improved behavior policy by modifying the distribution of the original behavior policy, and when constrained to this improved policy, a policy-constrained offline RL algorithm is likely to yield a better solution. We provide two practical implementations to balance computation and performance: one estimates priorities based on a fit value network [advantage-based ODPR (ODPR-A)] and the other utilizes trajectory returns [return-based ODPR (ODPR-R)] for quick computation. As a highly compatible plug-and-play component, ODPR is evaluated with five prevalent offline RL algorithms: behavior cloning (BC), twin delayed deep deterministic policy gradient + BC (TD3 + BC), OnestepRL, conservative Q-learning (CQL), and implicit Q-learning (IQL). Our experiments confirm that both ODPR-A and ODPR-R significantly improve performance across all baseline methods. Moreover, ODPR-A can be effective in some challenging settings, i.e., without trajectory information. Code and pretrained weights are available at https://github.com/yueyang130/ODPR.
External IDs:dblp:journals/tnn/YueKMYHSY25
Loading