Keywords: reinforcement learning, off-policy reinforcement learning, off-policy evaluation, deep reinforcement learning
Abstract: In this work, we analyze the effectiveness of the Bellman equation as a proxy objective for value prediction accuracy in off-policy evaluation. While the Bellman equation is uniquely solved by the true value function over all state-action pairs, we show that in the finite data regime, the Bellman equation can be satisfied exactly by infinitely many suboptimal solutions. This eliminates any guarantees relating Bellman error to the accuracy of the value function. We find this observation extends to practical settings; when computed over an off-policy dataset, the Bellman error bears little relationship to the accuracy of the value function. Consequently, we show that the Bellman error is a poor metric for comparing value functions, and therefore, an ineffective objective for off-policy evaluation. Finally, we discuss differences between Bellman error and the non-stationary objective used by iterative methods and deep reinforcement learning, and highlight how the effectiveness of this objective relies on generalization during training.
One-sentence Summary: We show that minimizing the Bellman error does not result in better value prediction accuracy.
12 Replies
Loading