Abstract: Actor-critic methods in reinforcement learning leverage the action value function (critic) by temporal difference learning to be used as an objective for policy improvement for the sake of sample efficiency against on-policy methods. The well-known result, critic overestimation, is usually handled by pessimistic policy evaluation based on critic uncertainty, which may lead to critic underestimation. This means that pessimism is a sensitive parameter and requires careful tuning. Most methods employ an ensemble approach to represent the uncertainty of critic estimates, but it comes with the cost of computational burden. To mitigate the sample and computation inefficiency of the actor-critic approach, we propose a novel and simple algorithm in this paper, called Deep Bayesian Actor-Critic (DBAC), that employs Bayesian dropout and a heteroscedastic critic network instead of an ensemble to make the agent uncertainty-aware. To mitigate the overestimation bias of critic, pessimistic policy evaluation is conducted where pessimism is proportional to the uncertainty of predictions. Using dropout along with a distributional representation of the critic leads to more computation-efficient calculations. With empirically determined optimal pessimism and dropout regularization, only a single critic network is enough to achieve high sample and computation efficiency, with near SOTA performance.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Incorrect grammars are fixed and some parts of the paper are updated after reviews and we tried to answer the reviewers' questions as much as possible. Other than typo and grammar corrections, changes are as follows;
- The TOP paper (Moskovitz et al.) is cited in the 4th section.
- A more detailed explanation of why zero dropout is used for HalfCheetah-v4 and Walker2d-v4.
- Some additions to future work about pessimism - dropout sensitivity, uncertainty calibration etc.
- We implemented the TOP algorithm with SAC variant (to make fair comprasion to other maximum entropy methods) run for 6 MuJoCo environments and added respective reward curves.
- Value error curves for ablation studies are added to get more insights about pessimism and dropout.
- We explained why aleatoric uncertainty modeling is necessary even in determinstic environments.
- Third ablation study is conducted for two stochastic environments, by comparing DBAC and TQC.
- As we previously conducted target entropy ablation, we added them as 4th ablation study.
- Network architectures are visualized in Appendix.
Second update;
- Pessimism ablation is extended by using 5 parameters for each environment.
- Minor grammar improvements and fixes.
Third update;
- Realized that Theorem 4.1 (inside the text) has missing reward term, this is fixed.
- In section 1.2, learning setting it is stated that DBAC employs experience buffer (again, forgotten to add before).
Assigned Action Editor: ~Marc_Lanctot1
Submission Number: 3100
Loading