Simplifying Actor-Critic Reinforcement Learning: Mitigating Overestimation Bias with a Single Distributional Critic

Simplifying Actor-Critic Reinforcement Learning: Mitigating Overestimation Bias with a Single Distributional Critic

TMLR Paper3867 Authors

07 Jan 2025 (modified: 25 Mar 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Actor-critic methods in reinforcement learning leverage the action value function (critic) by temporal difference learning to be used as an objective for policy improvement for the sake of sample efficiency against on-policy methods. The well-known result, critic overestimation, is usually handled by pessimistic policy evaluation based on critic uncertainty, which may lead to critic underestimation. This means that pessimism is a sensitive parameter and requires careful tuning. Current methods use epistemic or predictive uncertainty of the critic for pessimistic learning, employing dropout and ensemble approaches. In this paper, we propose a novel actor-critic algorithm, called Stochastic Actor-Critic (STAC), that employs distributional representation (for aleatoric uncertainty) and Bayesian dropout (for epistemic uncertainty) for critic and actor to make the agent uncertainty aware. Unlike previous methods, pessimistic updates are only proportional to aleatoric uncertainty of the critic, not epistemic uncertainty. This approach alone is enough to mitigate critic overestimation. Introducing Bayesian dropout further improves performance in some environments, although the resulting uncertainty is not used for pessimistic objective. With empirically determined optimal pessimism and dropout rate, only a single distributional critic network is enough to achieve high sample efficiency. In addition, using a single critic with an update-to-data (UTD) ratio equal to 1 provides computation-efficient learning compared to other SOTA methods.

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=N5txhr5EDL

Changes Since Last Submission: ## Second revision - Uploaded the codebase with results. - Detailed explanation of the usage of epistemic uncertainty with dropout for exploration in depth, with related references. - A detailed discussion of pessimism effect on highly noisy environments, using results from `BipedalWalkerHardcore-v3`. - Last step mean, variance score table is added to compare methods. - Only the best methods are highlighted in the tables. - More smoothing through time is used for learning graphs, with a square aspect ratio to fill pages. - Definition improvements on section Maximum Entropy RL. - Clear distinctive definition of Bellman backup on deterministic critics and expected Bellman backup on stochastic critics, and temporal difference target. - More assumptions about critic (finite support property of $\mathcal{Q}(s,a)$ ) to use Tonelli's theorem. - Taking the main results into the text. - Discussed the consequences of Normal distributed critic assumption. - STAC bootstraps one-step uncertainty by bootstrapping the mean of the next-state value. This is clarified in Section 5.1. - Prior Art section is renamed to Related Work - Abstract update to emphasize dropout is also used for exploration. - Samples from $\mathcal{Q}$ was denoted as $q$, but changed to $Q$. - Sparse reward and stochastic nature of `BipedalWalkerHardcore-v3` environment are explained. The results specific to this environment are extensively discussed. - Adaptive pessimism tuning by bandit is discussed in Section 5.2, and the reason why it is not used is explained. - Some sentences in Section 3 are shortened and the toy example is added to demonstrate that dropout and distributional networks capture different types of uncertainties. ## Third revision - Dropout is used as regularization/exploration heuristic only, this is clarified. - `Q` distribution is a bounded distribution, not a finite distribution. This is clarified. - The algorithm should still be tested in highly stochastic environments. This is noted in conclusion part. - Epistemic uncertainty part is shortened, as this is not directly related to the main topic. ## Fourth revision - In Experiments section, we mentioned that `REDQ` algorithm with UTD ratio 1 is also tested, but not demonstrated as the results are very similar to `SAC`. This is explained in the text. - Wording improvements on Section 5. - Epistemic Uncertainty section is shortened more, without equations. The visual is also updated to demonstrate the difference between epistemic and aleatoric uncertainty. - Under section 5.1, we explained that normal distribution for critic is only for practical implementation. - Explained the reason behind using pessimistic objectives for both policy improvement (policy update) and policy evaluation (critic update), under Section 5.2 Pessimistic Objective.

Assigned Action Editor: ~Marc_Lanctot1

Submission Number: 3867

Loading