Beyond Marginals: Capturing Correlated Returns through Joint Distributional Reinforcement Learning

Ege Can Kaya; Mahsa Ghasemi; Abolfazl Hashemi

Beyond Marginals: Capturing Correlated Returns through Joint Distributional Reinforcement Learning

Ege Can Kaya, Mahsa Ghasemi, Abolfazl Hashemi

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0

Track: Research Track

Keywords: reinforcement learning, distributional reinforcement learning

Abstract: Distributional reinforcement learning (DRL) has emerged in recent years as a powerful paradigm that aims to learn the full distributions of returns starting from different state-action pairs under a policy, rather than only their expected values. The existing DRL algorithms learn the return distribution independently for each action at a state. However, we establish that in many environments, the returns for different actions at the same state are statistically dependent due to shared transition and reward structure, and that learning only per-action marginals discards potentially exploitable information. We formalize a joint MDP view that lifts and MDP into a POMDP whose hidden states encode coupled potential outcomes across actions, and we derive joint distributional Bellman equations together with a joint iterative policy evaluation (JIPE) scheme with convergence guarantees. On the algorithmic side, we introduce a deep learning method that represents joint returns with homoscedastic Gaussian mixture models and trains them by matching a multivariate TD target. Empirically, we validate the proposed framework on two custom MDPs with known correlation structure (a bandit with shared randomness in rewards, and a windy gridworld environment), and illustrate the learned joint structure in the classic control task CartPole and the Arcade Learning Environment game Pong. Together, these results demonstrate that modeling cross-action return dependence yields accurate joint moments and informative joint distributions that can support safer, more sample-efficient control.

Submission Number: 60

Loading