Wasserstein Policy Optimization

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We derive a novel policy gradient algorithm from Wasserstein gradient flows and show that it is simple and effective at deep reinforcement learning for continuous control.
Abstract: We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of the *gradient* of the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions -- without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods.
Lay Summary: One of the most important problems in AI is how to learn to make sequences of good decisions (or 'actions') to optimize a desired objective. Many important applications require continuous actions. Such actions consist of smooth parameters (rather than, say, discrete symbols), each of which needs to be set appropriately and adaptively, and often quite precisely, to get the desired behaviour. Consider, for example, controlling robotic limbs or using magnets to control plasma in a nuclear fusion reactor. We present a new reinforcement learning algorithm, that can learn directly from experience in such settings, and which works especially well in “high-dimensional” problems with many continuous action components. This makes it a promising algorithm for applications such as robotic control or nuclear fusion. However, the algorithm is not specific to these tasks and we show it is effective on several continuous control problems. The new algorithm is derived from a mathematical concept called a "Wasserstein gradient flow". This has inspired earlier algorithms, but a practical and efficient algorithm had not previously been derived directly. The resulting algorithm is intuitive and elegant, and we discuss how it relates to, and improves upon, previous algorithms.
Link To Code: https://github.com/google-deepmind/acme/blob/master/examples/baselines/rl_continuous/run_wpo.py
Primary Area: Reinforcement Learning
Keywords: Policy Optimization, Wasserstein metric, Optimal Transport, Gradient Flow, Deep Reinforcement Learning, Actor-Critic, Continuous Control
Submission Number: 13462
Loading