Can K Heads Explore Better Than One in Online Reinforcement Learning?

Abhishek Jha; Satyapragnya Kar; Kishlay Kumar; Stephanie Milani; Rajesh Ranganath

Can K Heads Explore Better Than One in Online Reinforcement Learning?

Abhishek Jha, Satyapragnya Kar, Kishlay Kumar, Stephanie Milani, Rajesh Ranganath

Published: 25 May 2026, Last Modified: 27 May 2026DEMO 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: ensemble exploration, uncertainty quantification, generative policies, online adaptation

TL;DR: ESAC runs an ensemble of diffusion/flow policies with independent critics and selects actions via mean return plus a calibrated trajectory disagreement bonus, recovering principled exploration that shared-Q ensembles provably lose.

Abstract: Ensemble methods provide principled exploration for Gaussian policies in reinforcement learning, but their extension to the increasingly popular family of generative policies based on diffusion and flow matching is non-trivial. The standard ensemble template, namely $K$ value heads trained on bootstrapped data subsets and selected by exploration strategies, fails for Q-weighted generative losses. When the $K$ policy heads share a common Q-network, the Q-weighted importance weights are identical across heads by construction, so the denoising loss produces head invariant gradients in expectation and the heads collapse to a single function. Per-sample bootstrap masking cannot prevent this collapse because the mask only modulates a signal that is already head invariant before it enters the loss. This failure mode is specific to importance weighted generative losses and does not arise for standard Gaussian policy ensembles. To resolve this, we introduce Ensemble Soft Actor-Critic (ESAC), an ensemble framework that pairs each generative head with its own independent Q-network, so that per-head Q-weighted objectives drive the heads toward distinct optima. Within the same architecture we propose Trajectory Integrated Disagreement Exploration (TIDE), an action-selection rule that scores candidates by ensemble disagreement integrated across the full denoising trajectory and adaptively calibrates the bonus by the running correlation between velocity disagreement and Q-uncertainty. We formalize the collapse failure mode and the diversity restoration mechanism as theorems. On MuJoCo locomotion benchmarks with both diffusion and conditional flow matching backbones, ESAC delivers consistent multi-seed gains over its soft diffusion actor-critic backbone with the largest gains on tasks where directed exploration matters most.

Submission Number: 162

Loading