Decision-Aware Model Learning for Actor-Critic Methods: When Theory Does Not Meet PracticeDownload PDF

Published: 09 Dec 2020, Last Modified: 05 May 2023ICBINB 2020 SpotlightReaders: Everyone
Keywords: reinforcement learning, model-based, actor-critic, decision aware, optimization
TL;DR: We adapted the ideas of value-aware model learning (VAML) to the Actor-Critic framework. We showed that VAML does not translate well to the deep RL setting, with conventional MLE-based approaches yielding comparable, or better, end task performance.
Abstract: Actor-Critic methods are a prominent class of modern reinforcement learning algorithms based on the classic Policy Iteration procedure. Despite many successful cases, Actor-Critic methods tend to require a gigantic number of experiences and can be very unstable. Recent approaches have advocated learning and using a world model to improve sample efficiency and reduce reliance on the value function estimate. However, learning an accurate dynamics model of the world remains challenging, often requiring computationally costly and data-hungry models. More recent work has shown that learning an everywhere accurate model is unnecessary and often detrimental to the overall task; instead, the agent should improve the world model on task-critical regions. For example, in Iterative Value-Aware Model Learning, the authors extend model-based value iteration by incorporating the value function (estimate) into the model loss function, showing the novel model objective reflects improved performance in the end task. Therefore, it seems natural to expect that model-based Actor-Critic methods can benefit equally from learning value-aware models, improving overall task performance, or reducing the need for large, expensive models. However, we show empirically that combining Actor-Critic and value-aware model learning can be quite difficult and that naive approaches such as maximum likelihood estimation often achieve superior performance with less computational cost. Our results suggest that, despite theoretical guarantees, learning a value-aware model in continuous domains does not ensure better performance on the overall task.
1 Reply

Loading