Double Policy Estimation for Importance Sampling in Sequence Modeling-Based Reinforcement Learning

Hanhan Zhou; Tian Lan; Vaneet Aggarwal

Double Policy Estimation for Importance Sampling in Sequence Modeling-Based Reinforcement Learning

Hanhan Zhou, Tian Lan, Vaneet Aggarwal

Published: 07 Nov 2023, Last Modified: 17 Nov 2023FMDM@NeurIPS2023EveryoneRevisionsBibTeX

Keywords: Offline Reinforcement Learning, Decision Transformer

Abstract: Offline reinforcement learning aims to utilize datasets of previously gathered environment-action interaction records to learn a policy without access to the real environment. Recent work has shown that offline reinforcement learning can be formulated as a sequence modeling problem and solved via supervised learning with approaches such as decision transformer. While these sequence-based methods achieve competitive results over return-to-go methods, especially on tasks that require longer episodes or with scarce rewards, importance sampling is not considered to correct the policy bias when dealing with off-policy data, mainly due to the absence of behavior policy and the use of deterministic evaluation policies. To this end, we propose an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation (DPE) in a unified framework with statistically proven properties on variance reduction. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks. Our method brings performance improvements on selected methods and outperforms state-of-the-art baselines in several tasks, demonstrating the advantages of enabling double policy estimation for sequence-modeled reinforcement learning.

Submission Number: 15

Loading