Critic Identifiability in Offline Reinforcement Learning with a Deterministic Exploration Policy

TMLR Paper895 Authors

24 Feb 2023 (modified: 31 May 2023)Rejected by TMLREveryoneRevisionsBibTeX
Abstract: Offline Reinforcement Learning (RL) promises to enable the adoption of RL in settings where logged interaction data is abundant but running live experiments is costly or impossible. The setting where data was gathered with a stochastic exploration policy has been extensively studied, however; in practice, log data is often generated by a deterministic policy. In this work, we examine this deterministic offline RL setting from both a theoretical and practical perspective. We describe the critic identifiability problem from a theoretical standpoint, arguing that algorithms designed for stochastic exploration are ostensibly unsuited for the deterministic version of the problem. We elucidate the problem further using a set of experiments on contextual bandits as well as continuous control problems. We conclude that, quite surprisingly, the tools for stochastic offline RL, notably the TD3+BC algorithm, are applicable after all.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: N/A
Assigned Action Editor: ~Amir-massoud_Farahmand1
Submission Number: 895
Loading