Abstract: We explore the off-policy value prediction problem in the reinforcement learning setting, where one estimates the value function of the target policy using the sample trajectories obtained from a behaviour policy. Applying importance sampling based methods are typically a go-to approach for getting such estimates but tend to suffer high error in long-horizon problems since it can only correct single-step discrepancies and fails to address steady-state bias - skewed state visitation under the behavior policy. In this paper,
we present an algorithm for alleviating this bias in the off-policy value prediction using linear function approximation by correcting the state visitation distribution discrepancies. We establish rigorous theoretical guarantees, proving asymptotic convergence under Markov noise with ergodicity and demonstrating that the spectral properties of the corrected update matrix ensure stability. Most significantly, we derive an error decomposition showing that the total estimation error is bounded by a constant multiple of the best achievable approximation within the function class, where this constant transparently depends on distribution estimation quality and feature design. Empirical evaluation across multiple benchmark domains demonstrates that our method effectively mitigates steady-state bias and can be a viable alternative to existing methods in scenarios where distributional shift is critical.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=I34YY15aDN&referrer=%5Bthe%20profile%20of%20Ajin%20George%20Joseph%5D(%2Fprofile%3Fid%3D~Ajin_George_Joseph3)
Changes Since Last Submission: TMLR's stylefile format error. We have now fixed this
Assigned Action Editor: ~Bo_Dai1
Submission Number: 5632
Loading