2021 (modified: 24 Feb 2022)COLT 2021Readers: Everyone
Abstract:We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinite-horizon discounted Markov decision process (MDP). While existing analyses of comm...