Keywords: Deep RL, Value estimation, Experimental, MSE Bias-Variance Decomposition
TL;DR: We introduce EVarEst, a new value estimation objective penalizing the MSE with the residual errors variance, aiming to mitigate the impact of policy non-stationarity in reinforcement learning
Abstract: Modern reinforcement learning (RL) algorithms often rely on estimating the value of state-action pairs for a given policy, typically using neural networks to model this value. However, this estimation is hindered by intrinsic policy non-stationarity during training. This leads to increasing errors as the model ability to adapt to new policies degrades over time. To address this issue, we introduce EVarEst, a novel objective function that enhances the standard mean squared error (MSE) by incorporating a weighted penalty on the variance of residual errors. Unlike traditional penalties, this variance term reweights the bias-variance decomposition of the MSE without adding extra terms. EVarEst encourages the value network to generalize better across successive policies by promoting consistent prediction errors across a broader range of states and actions. EVarEst can be used in place of the value network objective in any RL algorithm with minimal modification to the algorithm and source code. Empirically we show that the traditional MSE objective is generally under-performing when compared to some version of EVarEst in terms of policy performance, illustrating the benefits of the flexibility offered by EVarEst. Additionally, we offer insights into how this new objective enhances performance, specifically by improving adaptability to policy non-stationarity.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Timothée_Mathieu1
Track: Regular Track: unpublished work
Submission Number: 166
Loading