Provably Efficient Reward Transfer in Reinforcement Learning with Discrete Markov Decision Processes

TMLR Paper6217 Authors

15 Oct 2025 (modified: 24 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this paper, we propose a new solution to reward adaptation (RA) in reinforcement learning, where the agent adapts to a target reward function based on one or more existing source behaviors learned a priori under the same domain dynamics but different reward functions. While learning the target behavior from scratch is possible, it is often inefficient given the available source behaviors. Our work introduces a new approach to RA through the manipulation of Q-functions. Assuming the target reward function is a known function of the source reward functions, we compute bounds on the Q-function and present an iterative process (akin to value iteration) to tighten these bounds. Such bounds enable action pruning in the target domain before learning even starts. We refer to this method as "$Q-Manipulation$" (Q-M). The iteration process assumes access to a lite-model, which is easy to provide or learn. We formally prove that Q-M, under discrete domains, does not affect the optimality of the returned policy and show that it is provably efficient in terms of sample complexity in a probabilistic sense. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. **Clarified theoretical assumptions in the body of Theorem 3.7** 2. **Additional Results added to appendix** * B.1 Action Pruning Analysis * B.2 Scalability to Larger MDP * B.3 Pruning Threshold Ablation * B.4 Noisy source reward analysis * B.5 Overhead and target training running time analysis 3. **Added discussion for Noisy source reward's effect on Q-M and M-Q-M** All modifications are highlighted in blue.
Assigned Action Editor: ~Pablo_Samuel_Castro1
Submission Number: 6217
Loading