Provably Efficient Reward Transfer in Reinforcement Learning with Discrete Markov Decision Processes
Abstract: In this paper, we propose a new solution to reward adaptation (RA) in reinforcement learning, where the agent adapts to a target reward function based on one or more existing source behaviors learned a priori under the same domain dynamics but different reward functions. While learning the target behavior from scratch is possible, it is often inefficient given the available source behaviors. Our work introduces a new approach to RA through the manipulation of Q-functions. Assuming the target reward function is a known function of the source reward functions, we compute bounds on the Q-function and present an iterative process (akin to value iteration) to tighten these bounds. The iteration process is based on a lite-model, which is assumed to be given or can be learned. The computed bounds enable action pruning in the target domain before learning even starts. We refer to this method as "$Q-Manipulation$" (Q-M). We formally prove that Q-M, under discrete domains and an accurate lite-model, does not affect the optimality of the returned policy and show that it is provably efficient in terms of sample complexity in a probabilistic sense. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Pablo_Samuel_Castro1
Submission Number: 6217
Loading