Transfer Q-learning for finite-horizon Markov decision processes

Elynn Chen, Sai Li, Michael I. Jordan

Published: 28 Oct 2025, Last Modified: 28 Jan 2026Electronic Journal of StatisticsEveryoneCC BY 4.0

Abstract: Time-inhomogeneous finite-horizon Markov decision processes (MDP) are frequently employed to model decision-making in dynamic treatment regimes and other statistical reinforcement learning (RL) scenarios. These fields, especially healthcare and business, often face challenges such as high-dimensional state spaces and time-inhomogeneity of the MDP process, compounded by insufficient sample availability which complicates informed decision-making. To overcome these challenges, we investigate knowledge transfer within time-inhomogeneous finite-horizon MDP by leveraging data from both a target RL task and several related source tasks. We have developed transfer learning (TL) algorithms that are adaptable for both batch and online Q-learning, integrating valuable insights from offline source studies. The proposed transfer Q-learning algorithm contains a novel re-targeting step that enables cross-stage transfer along multiple stages in an RL task, besides the usual cross-task transfer for supervised learning. We establish the first theoretical justifications of TL in RL tasks by showing a faster rate of convergence of the Q-function estimation in the offline RL transfer, and a lower regret bound in the offline-to-online RL transfer under stage-wise reward similarity and mild design similarity across tasks. Empirical evidence from both synthetic and real datasets is presented to evaluate the proposed algorithm and support our theoretical results.