Relative Entropy Pathwise Policy Optimization

Claas A Voelcker; Axel Brunnbauer; Marcel Hussing; Michal Nauman; Pieter Abbeel; Radu Grosu; Eric Eaton; Amir-massoud Farahmand; Igor Gilitschenski

Relative Entropy Pathwise Policy Optimization

Claas A Voelcker, Axel Brunnbauer, Marcel Hussing, Michal Nauman, Pieter Abbeel, Radu Grosu, Eric Eaton, Amir-massoud Farahmand, Igor Gilitschenski

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, parallel simulation, value function, ppo, policy gradients, policy optimization

TL;DR: REPPO uses straight-through gradient estimation via a surrogate Q function to obtain more accurate policy gradients.

Abstract: Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet their high variance often undermines training stability. Improving a policy through state-action value functions, for example by differentiating Q with regard to the policy, alleviates the variance issues. However, this requires an accurate action-conditioned value function, which is notoriously hard to learn without relying on replay buffers for reusing past off-policy data. We present Relative Entropy Pathwise Policy Optimization, an algorithm that trains Q-value models purely from on-policy trajectories, unlocking the use of Q function derivatives to compute policy updates in the context of on-policy learning. We show how to combine stochastic policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning. This results in an efficient on-policy algorithm that combines the stability of Q-based policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. Compared to state-of-the-art on two standard GPU-parallelized benchmarks, REPPO provides strong empirical performance at superior sample efficiency, wall-clock time, memory footprint, and hyperparameter robustness.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 3112

Loading