A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning

Zechen Wu; Amy Greenwald; Ronald Parr

A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning

Zechen Wu, Amy Greenwald, Ronald Parr

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY-SA 4.0

Keywords: Policy Evaluation, Temporal Difference Learning, Fitted Q-iteration, Target Network, Offline RL, Linear Function Approximation, Matrix Splitting

Abstract: In off-policy policy evaluation (OPE) tasks within reinforcement learning, Temporal Difference Learning(TD) and Fitted Q-Iteration (FQI) have traditionally been viewed as differing in the number of updates toward the target value function: TD makes one update, FQI makes an infinite number, and Partial Fitted Q-Iteration (PFQI) performs a finite number. We show that this view is not accurate, and provide a new mathematical perspective under linear value function approximation that unifies these methods as a single iterative method solving same linear system, but using different matrix splitting schemes and preconditioners. We show that increasing the number of updates under the same target value function, i.e., the target network technique, is a transition from using a constant preconditioner to using a data-feature adaptive preconditioner. This elucidates, for the first time, why TD convergence does not necessarily imply FQI convergence, and establishes tight convergence connections among TD, PFQI, and FQI. Our framework enables sharper theoretical results than previous work and characterization of the convergence conditions for each algorithm, without relying on assumptions about the features (e.g., linear independence). We also provide an encoder-decoder perspective to better understand TD’s convergence conditions, and prove, for the first time, that when a large learning rate doesn’t work, trying a smaller one may help(for batch TD). Our framework also leads to the discovery of new crucial conditions on features for convergence, and shows how common assumptions about features influence convergence, e.g., the assumption of linearly independent features can be dropped without compromising the convergence guarantees of stochastic TD in the on-policy setting. This paper is also the first to introduce matrix splitting into the convergence analysis of these algorithms.

Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)

Submission Number: 22351

Loading