Relative Value Learning

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Relative Value Learning, On-Policy Actor-Critic, GAE, PPO
TL;DR: Learn a pairwise, antisymmetric value-difference critic and reconstruct GAE from differences, giving a drop-in PPO critic that avoids gauge issues and performs competitively on Atari.
Abstract: In reinforcement learning (RL), critics traditionally learn absolute state values, estimating how good a particular situation is in isolation. Adding any constant to $V(s)$ leaves action preferences unchanged. Thus only value differences are relevant for decision making. Motivated by this fact, we ask the question whether these differences can be learned directly. For this, we propose \emph{Relative Value Learning} (RV), a framework that considers antisymmetric value differences $\Delta(s_i, s_j) = V(s_i) - V(s_j)$. We define a new pairwise Bellman operator and prove it is a $\gamma$-contraction with a unique fixed point equal to the true value differences, derive well-posed $1$-step/$n$-step/$\lambda$-return targets and reconstruct generalized advantage estimation from pairwise differences to obtain an unbiased policy-gradient estimator (R-GAE). Besides rigorous theoretical contributions, we integrate RV with PPO and achieve competitive performance on the Atari benchmark (49 games, ALE) compared to standard PPO, indicating that relative value estimation is an effective alternative to absolute critics.
Primary Area: reinforcement learning
Submission Number: 12688
Loading