Keywords: Expectile, Robust RL, Robust MDPs
TL;DR: To introduce a form of pessimism, we propose to replace this expectation with an expectile. In practice, this can be very simply done by replacing the $L_2$ loss with a more general expectile loss for the critic.
Abstract: Many classic Reinforcement Learning (RL) algorithms rely on a Bellman operator, which involves an expectation over the next states, leading to the concept of bootstrapping. To introduce a form of pessimism, we propose to replace this expectation with an expectile. In practice, this can be very simply done by replacing the $L_2$ loss with a more general expectile loss for the critic. Introducing pessimism in RL is desirable for various reasons, such as tackling the overestimation problem (for which classic solutions are double Q-learning or the twin-critic approach of TD3) or robust RL (where transitions are adversarial). We study empirically these two cases. For the overestimation problem, we show that the proposed approach, \texttt{ExpectRL}, provides better results than a classic twin-critic. On robust RL benchmarks, involving changes of the environment, we show that our approach is more robust than classic RL algorithms. We also introduce a variation of \texttt{ExpectRL} combined with domain randomization which is competitive with state-of-the-art robust RL agents. Eventually, we also extend \texttt{ExpectRL} with a mechanism for choosing automatically the expectile value, that is the degree of pessimism.
Supplementary Material: zip
Submission Number: 30
Loading