Beyond Marginals: Capturing Dependent Returns through Joint Moments in Distributional Reinforcement Learning

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning, distributional reinforcement learning, interpretability, safe reinforcement learning
Abstract: Distributional reinforcement learning (DRL) has emerged as a paradigm that aims to learn full distributions of returns under a policy, rather than only their expected values. The existing DRL algorithms learn the return distribution independently for each action at a state. However, we establish that in many environments, the returns for different actions at the same state are statistically dependent due to shared transition and reward structure, and that learning only per-action marginals discards information that is exploitable for secondary objectives. We formalize a joint Markov decision process (MDP) view that lifts an MDP into a partially-observable MDP whose hidden states encode coupled potential outcomes across actions, and we derive joint distributional Bellman equations together with a joint iterative policy evaluation (JIPE) scheme with convergence guarantees. We introduce a deep learning method that represents joint returns with Gaussian mixture models with optimality and convergence guarantees. Empirically, we first validate the JIPE scheme on MDPs with known correlation structure. Then, we illustrate the learned joint structure in control and Arcade Learning Environment tasks using neural networks. Together, these results demonstrate that modeling return dependencies yields accurate joint moments and joint distributions that can help interpretability and be used in deriving safe and cost-efficient policies.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 22352
Loading