Keywords: behavior-regularized rl, offline rl, rlhf, optimal transport
TL;DR: A new approach to behavior-regularized RL without explicit regularization.
Abstract: We study behavior-regularized reinforcement learning (RL), which encompasses offline RL and RL from human feedback (RLHF). In both settings, regularization toward a reference distribution (offline data in offline RL or the supervised-finetuned policy in RLHF) is essential to prevent value over-optimization caused by erroneous out-of-distribution extrapolation. Existing methods typically add distance or divergence penalties on the learning objective, which introduces optimization challenges and over-conservatism. In this paper, we propose Value Gradient Flow (VGF), a new paradigm for behavior-regularized RL. VGF formulates an optimal transport problem from the reference distribution to the optimal policy distribution induced by the value function. This problem is solved via discrete gradient flow, where value gradients guide particles sampled from the reference distribution. Our theoretical analysis shows that an implicit behavior regularization is imposed by controlling the transport budget. This formulation avoids unnecessary restrictions on the optimization problem, enabling better reward maximization. Moreover, VGF operates without explicit policy parameterization while remaining expressive and flexible, allowing adaptively test-time scaling by adjusting the transport budget. Extensive experiments demonstrate that VGF significantly outperforms prior methods, achieving state-of-the-art results on offline RL benchmarks (D4RL, OGBench) and challenging RLHF tasks.
Primary Area: reinforcement learning
Submission Number: 691
Loading