Regret Minimization for Partially Observable Deep Reinforcement LearningDownload PDF

Peter H. Jin, Sergey Levine, Kurt Keutzer (privately revealed to you)

05 Feb 2018 (modified: 14 Oct 2024)ICLR 2018 Workshop SubmissionReaders: Everyone
Keywords: deep reinforcement learning
TL;DR: Advantage-based regret minimization is a new deep reinforcement learning algorithm that is particularly effective on partially observable tasks, such as 1st person navigation in Doom and Minecraft.
Abstract: Deep reinforcement learning algorithms that estimate state and state-action value functions have been shown to be effective in a variety of challenging domains, including learning control strategies from raw image pixels. However, algorithms that estimate state and state-action value functions typically assume a fully observed state and must compensate for partial or non-Markovian observations by using finite-length frame-history observations or recurrent networks. In this work, we propose a new deep reinforcement learning algorithm based on counterfactual regret minimization that iteratively updates an approximation to a cumulative clipped advantage function and is robust to partially observed state. We demonstrate that on several partially observed reinforcement learning tasks, this new class of algorithms can substantially outperform strong baseline methods: on Pong with single-frame observations, and on the challenging Doom (ViZDoom) and Minecraft (Malmö) first-person navigation benchmarks.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/regret-minimization-for-partially-observable/code)
1 Reply

Loading