Reducing Overestimation by Measuring Critic Disagreement in Multi-Critics Architectures

Published: 19 Dec 2025, Last Modified: 05 Jan 2026AAMAS 2026 FullEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Overestimation bias, Ensemble
Abstract: We introduce Ensemble Std (ES), a lightweight regularizer for multi-critic deep reinforcement learning that mitigates overestimation by calibrating target values according to critic disagreement. ES treats the dispersion of target-value estimates across an ensemble of Q-functions as an uncertainty signal, applying an adaptive subtractive penalty when disagreement is high. This yields uncertainty-aware, conservative targets that preserve learning signal where critics agree and temper optimism where they do not--complementing minimum-based targets without imposing uniformly pessimistic updates. ES operates directly on critic targets, requiring no architectural changes, and integrates seamlessly into a wide range of actor–critic algorithms. In practice, ES was easily plugged into TD3, SAC, and TD3+BC with negligible overhead, consistently improving stability and returns while reducing variance. Overall, ES offers a simple, conceptually transparent mechanism that turns ensemble disagreement into principled value regularization, making multi-critic learners more robust in noisy, uncertain, and data-limited regimes. Our code is available at \url{https://github.com/anonymouszxcv16/ES}.
Area: Learning and Adaptation (LEARN)
Generative A I: I acknowledge that I have read and will follow this policy.
Submission Number: 732
Loading