Distributional value gradients for stochastic environments

Distributional value gradients for stochastic environments

ICLR 2026 Conference Submission13919 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Distributional Reinforcement Learning, Value Gradients, Sobolev Training, Stochastic Environments, MuJoCo Benchmarks, Noisy Dynamics

TL;DR: We introduce Distributional Sobolev Training, which models distributions over values and their gradients via a Sobolev Temporal Difference operator, proving contraction and improving RL under stochastic dynamics.

Abstract: Gradient-regularized value learning methods improve sample efficiency by leveraging learned models of transition dynamics and rewards to estimate return gradients. However, existing approaches, such as MAGE, struggle in stochastic or noisy environments, limiting their applicability. In this work, we address these limitations by extending distributional reinforcement learning on continuous state-action spaces to model not only the distribution over scalar state-action value functions but also over their gradients. We refer to this approach as Distributional Sobolev Training. Inspired by Stochastic Value Gradients (SVG), our method utilizes a one-step world model of reward and transition distributions implemented via a conditional Variational Autoencoder (cVAE). The proposed framework is sample-based and employs Max-sliced Maximum Mean Discrepancy (MSMMD) to instantiate the distributional Bellman operator. We prove that the Sobolev-augmented Bellman operator is a contraction with a unique fixed point, and highlight a fundamental smoothness trade-off underlying contraction in gradient-aware RL. To validate our method, we first showcase its effectiveness on a simple stochastic reinforcement‐learning toy problem, then benchmark its performance on several MuJoCo environments.

Primary Area: reinforcement learning

Submission Number: 13919

Loading