BiVWAC: Improving deep reinforcement learning algorithms using Bias-Variance Weighted Actor-Critic

Yann Berthelot; Timothée Mathieu; Riad Akrour; Philippe Preux

BiVWAC: Improving deep reinforcement learning algorithms using Bias-Variance Weighted Actor-Critic

Yann Berthelot, Timothée Mathieu, Riad Akrour, Philippe Preux

26 Sept 2024 (modified: 21 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Bias, Variance, Actor-Critic, Deep Reinforcement Learning, SAC, PPO, AVEC, Mujoco

TL;DR: We study weightings of bias-variance in the critic loss to improve actor-critic performances

Abstract: We introduce $\textrm{\textbf{Bi}as-\textbf{V}ariance \textbf{W}eighted \textbf{A}ctor \textbf{C}ritic (\textbf{BiVWAC})}$, a modification scheme for actor-critic algorithms allowing control over the bias-variance weighting in the critic. In actor-critic algorithms, the critic loss is the Mean Squared Error (MSE). The MSE may be decomposed in terms of bias and variance. Based on this decomposition, BiVWAC constructs a new critic loss, through a hyperparameter $\alpha$, to weigh bias vs variance. MSE and Actor with Variance Estimated Critic (AVEC, which only considers the variance in the MSE decomposition) are special cases of this weighting for $\alpha=0.5$ and $\alpha=0$ respectively. We demonstrate the theoretical consistency of our new critic loss and measure its performance on a set of tasks. We also study value estimation and gradient estimation capabilities of BiVWAC to understand the means by which BiVWAC impacts performance. We show experimentally that the MSE is suboptimal as a critic loss when compared to other $\alpha$ values. We equip SAC and PPO with the BiVWAC loss to obtain BiVWAC-SAC and BiVWAC-PPO and we propose a safe $\alpha$ value, $\alpha^*$, for which BiVWAC-SAC is better than or equal to SAC in all studied tasks but one in terms of policy performance. We also point out that BiVWAC introduces minimal changes to the algorithms and virtually no additional computational cost. In addition we also present a method to compare the impact of critic modifications between algorithms in a sound manner.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7073

Loading